2026-05-15DataMesh Consulting

15 May — Intelligence dashboard gets weekly quotas, cached-vs-billable split, and live provider rate-limit panel

Three small but operationally important additions to the operator dashboard. Weekly budget cards stop us silently overspending across a week of normal-looking days. The 24-hour LLM call count now separates cached hits from real billable calls, so the headline number reflects actual spend instead of being padded by cache returns. And a new provider rate-limit panel captures Moonshot's `x-ratelimit-*` response headers in real time so we can see how close we are to throttling before it bites.

Where this picks up from yesterday

The V3 self-healing refactor that landed yesterday gave us the data: every Kimi call writes an analytics.llm_call_logs row with model, tokens in/out, duration, cached flag, and siteId. The Intelligence dashboard tab gave us the first read on it — daily token spend, per-site attribution, top keywords by cost.

What it didn't give us was forward visibility. Three blind spots remained, and today's commit closes them.

1 — Weekly budget cards

The existing dashboard tracked daily budget usage against AI_DAILY_TOKEN_BUDGET / AI_DAILY_USD_BUDGET. Useful, but deceptive at a weekly level: seven days at 70% of daily quota each look fine on every individual day card and add up to a month-end surprise.

New cards alongside the daily ones:

Weekly tokens used / budget — accumulated total since

ISO-Monday-00:00 UTC. Resets on the first second of every Monday morning UTC, regardless of operator timezone, so the card is stable across regions and DST shifts.

Weekly USD used / budget — same window, priced via the

same LLM_PRICING table that powers daily.

Both default to 7× the daily budget when AI_WEEKLY_TOKEN_BUDGET / AI_WEEKLY_USD_BUDGET aren't set explicitly, so the cards render meaningfully even with no operator config. Override them when the actual weekly target isn't 7× daily — for example, lower on weekends.

2 — Cached vs billable, separated

The "LLM Calls (24h)" tile had a credibility problem. Every cached embedding hit, every dedup-layer short-circuit, every Layer-2 page-fingerprint hit produced an analytics.llm_call_logs row with cachedResult=true. Those are wins — they're the savings the V3 dedup layers deliver — but they were padding the headline call count by ~30-50% on busy days, making the tile look like we were doing far more LLM work than we actually were.

Today's rename and split:

Tile is now "Billable Calls (24h)" — strictly

cachedResult=false rows. This is the number that corresponds to actual Moonshot/Kimi cost.

Sub-line below: "Cached: N · Total events: M" so the

caching effect is visible at a glance. A growing gap between billable and total is a good thing — it means the dedup layers are doing their job.

Combined with yesterday's V3 S2 + S4 (Layer-3 tender-content dedup, Layer-2 page-fingerprint dedup), the savings are now plain in the UI rather than implicit in the cost-delta math.

3 — Provider rate-limit panel

The thing that scared us most about scaling Hermes calls was hitting Moonshot's per-key rate limit without warning. Up to now we'd see 429s in the application logs and infer "we got close." A new ProviderRatelimitService lifts the x-ratelimit- response headers off every Moonshot HTTP call (we already had the underlying axios interceptors — this just captures the existing data) and writes a snapshot to Redis keyed on (provider, model) with a 1h TTL.

Surfaced at GET /v1/dashboard/intelligence/ratelimits:

Remaining requests in window (x-ratelimit-remaining-requests)

Remaining tokens in window (x-ratelimit-remaining-tokens)

Reset timer (x-ratelimit-reset-, normalized to seconds-

until-reset for client display)

Usage bar showing consumed / limit, colour-coded — green

> 30% remaining, amber 10–30%, red < 10%

Wired into three call paths:

backend/.../kimi.service — direct Kimi calls (match,

summary, keyword expansion).

backend/.../embedding.service — embedding requests.
Forwarded from Hermes HTTP fallback via /analyze-site →

relearn.processor — so site-learner calls show up in the panel even though they originate in Hermes.

One honest gap to call out: CLI subprocess paths (Kimi Code CLI) don't surface response headers, so those calls produce no rate-limit snapshots. The panel only reflects HTTP-fallback usage. CLI quota is subscription- billed rather than per-token, so this is consistent — the panel is about per-token API limits, which only the HTTP path can hit.

Why these three together

Each of these is a piece of the same picture: before, we could only see what AI spend had happened. After today, we can see what's about to happen:

Weekly cards catch trends across days that look fine

individually.

Cached/billable split shows whether our caching layers are

actually saving spend, not just generating telemetry.

Rate-limit panel warns us before a 429 storm rather than

after.

Combined with yesterday's site-health audit + auto-relearn, the operator dashboard is now closer to a system you can run without watching it constantly. That's the bar we've been trying to clear.

Status & what's next

Backend code shipped tonight. Dashboard UI shipped in the same commit (a single Intelligence-tab update). No data migration — analytics.llm_call_logs already had the columns we needed; the rate-limit Redis keys self-populate on first call after deploy.

What's queued next:

V3 S7 — admin surfaces in web + iOS for the site-

health audit verdicts and the incident timeline. Backend endpoints are live; just need the views.

V3 S8 — docs sweep. AGENTS.md, SYSTEM-STATE.md, and

HERMES_PIPELINE.md all reference V3 flags that didn't exist when those docs were last updated.