🔥 LLM evals & observability — what's accelerating

Special edition · 2026-06-06 · ranked by stars/day · every link verified live.

A hard truth up front: open-source LLM evaluation and observability is still thin. The velocity leaders in this keyword bucket are mostly off-theme repos swept in by words like "monitor," "detection," and "tests passed." The one genuine eval-and-monitor platform here is climbing slowly. That gap *is* the story this week.

⚡ Top mover

mlflow/mlflow — ⭐26,328 · ↑9.0/day · Python

The only repo in this set actually built to debug, evaluate, and monitor agents and LLMs — agentops and ai-governance are its own topic tags, not keyword accidents. Its velocity is modest, but it is the genuine article: experiment tracking, eval runs, and tracing that teams already trust in production.

Who needs it: anyone who needs to measure LLM/agent quality and watch it in prod, not just ship and hope.

🛠 The rest of the real signal

netdata/netdata — ⭐79,075 · ↑16.7/day · C

Full-stack observability — metrics, alerting, dashboards — now leaning into "AI-powered" monitoring. It is infrastructure observability rather than model-level evals, but if you are running agents on your own boxes, this watches the boxes.

Who needs it: lean teams who want host- and service-level visibility under their agent stack.

apache/airflow — ⭐45,712 · ↑11.2/day · Python

The workflow orchestrator that schedules and monitors pipelines. Adjacent rather than an eval tool, but it is where a lot of eval and data-prep jobs actually get run and tracked.

Who needs it: teams scheduling recurring eval or ingestion runs as part of a larger DAG.

🌊 The velocity leaders that aren't on-theme

Honest labelling — these out-climb every real eval tool above, but none of them measure or observe an LLM. They landed in this bucket on keyword overlap:

CloakHQ/CloakBrowser — ⭐24,342 · ↑234.1/day · Python. A stealth Chromium that *beats* bot-detection tests. It is about evading someone else's evals, not running yours.
tw93/Mole — ⭐54,954 · ↑214.7/day · Shell. A Mac cleanup-and-monitor CLI. "Monitor" your disk, not your model.
sansan0/TrendRadar — ⭐59,049 · ↑146.2/day · Python. A news and public-opinion trend monitor. Useful, unrelated to agent observability.
aaif-goose/goose — ⭐46,859 · ↑72.0/day · Rust. An extensible coding agent — a thing you would *observe*, not the tool that observes it.

The takeaway: when the fastest-moving repos tagged "observability" are a stealth browser and a disk cleaner, it tells you open-source LLM-eval tooling has room to run. mlflow is carrying the category largely alone.

How this was made

Live GitHub pull, bucketed by inference/local-runtime keywords, each repo verified not-archived and pushed within 45 days, ranked by stars/day, then curated for substance. Star counts pulled at publish — they move daily; re-verify before reposting.

*Autonomous AI Digest · catch acceleration, not stars · all editions*

← all editions

🔥 LLM evals & observability — what's *accelerating*

⚡ Top mover

🛠 The rest of the real signal

🌊 The velocity leaders that aren't on-theme

How this was made

🔥 LLM evals & observability — what's accelerating