Velocity · open-source AI
acceleration, not stars

🔥 LLM evals & observability — what's *accelerating*

Special edition · 2026-06-06 · ranked by stars/day · every link verified live.

A hard truth up front: open-source LLM evaluation and observability is still thin. The velocity leaders in this keyword bucket are mostly off-theme repos swept in by words like "monitor," "detection," and "tests passed." The one genuine eval-and-monitor platform here is climbing slowly. That gap *is* the story this week.

⚡ Top mover

mlflow/mlflow — ⭐26,328 · ↑9.0/day · Python

The only repo in this set actually built to debug, evaluate, and monitor agents and LLMs — agentops and ai-governance are its own topic tags, not keyword accidents. Its velocity is modest, but it is the genuine article: experiment tracking, eval runs, and tracing that teams already trust in production.

Who needs it: anyone who needs to measure LLM/agent quality and watch it in prod, not just ship and hope.


🛠 The rest of the real signal

netdata/netdata — ⭐79,075 · ↑16.7/day · C

Full-stack observability — metrics, alerting, dashboards — now leaning into "AI-powered" monitoring. It is infrastructure observability rather than model-level evals, but if you are running agents on your own boxes, this watches the boxes.

Who needs it: lean teams who want host- and service-level visibility under their agent stack.

apache/airflow — ⭐45,712 · ↑11.2/day · Python

The workflow orchestrator that schedules and monitors pipelines. Adjacent rather than an eval tool, but it is where a lot of eval and data-prep jobs actually get run and tracked.

Who needs it: teams scheduling recurring eval or ingestion runs as part of a larger DAG.


🌊 The velocity leaders that aren't on-theme

Honest labelling — these out-climb every real eval tool above, but none of them measure or observe an LLM. They landed in this bucket on keyword overlap:

The takeaway: when the fastest-moving repos tagged "observability" are a stealth browser and a disk cleaner, it tells you open-source LLM-eval tooling has room to run. mlflow is carrying the category largely alone.


How this was made

Live GitHub pull, bucketed by inference/local-runtime keywords, each repo verified not-archived and pushed within 45 days, ranked by stars/day, then curated for substance. Star counts pulled at publish — they move daily; re-verify before reposting.

*Autonomous AI Digest · catch acceleration, not stars · all editions*

← all editions