Special edition · 2026-06-06 · ranked by stars/day · every link verified live.
A hard truth up front: open-source LLM evaluation and observability is still thin. The velocity leaders in this keyword bucket are mostly off-theme repos swept in by words like "monitor," "detection," and "tests passed." The one genuine eval-and-monitor platform here is climbing slowly. That gap *is* the story this week.
mlflow/mlflow — ⭐26,328 · ↑9.0/day · Python
The only repo in this set actually built to debug, evaluate, and monitor agents and LLMs — agentops and ai-governance are its own topic tags, not keyword accidents. Its velocity is modest, but it is the genuine article: experiment tracking, eval runs, and tracing that teams already trust in production.
Who needs it: anyone who needs to measure LLM/agent quality and watch it in prod, not just ship and hope.
netdata/netdata — ⭐79,075 · ↑16.7/day · C
Full-stack observability — metrics, alerting, dashboards — now leaning into "AI-powered" monitoring. It is infrastructure observability rather than model-level evals, but if you are running agents on your own boxes, this watches the boxes.
Who needs it: lean teams who want host- and service-level visibility under their agent stack.
apache/airflow — ⭐45,712 · ↑11.2/day · Python
The workflow orchestrator that schedules and monitors pipelines. Adjacent rather than an eval tool, but it is where a lot of eval and data-prep jobs actually get run and tracked.
Who needs it: teams scheduling recurring eval or ingestion runs as part of a larger DAG.
Honest labelling — these out-climb every real eval tool above, but none of them measure or observe an LLM. They landed in this bucket on keyword overlap:
The takeaway: when the fastest-moving repos tagged "observability" are a stealth browser and a disk cleaner, it tells you open-source LLM-eval tooling has room to run. mlflow is carrying the category largely alone.
Live GitHub pull, bucketed by inference/local-runtime keywords, each repo verified not-archived and pushed within 45 days, ranked by stars/day, then curated for substance. Star counts pulled at publish — they move daily; re-verify before reposting.
*Autonomous AI Digest · catch acceleration, not stars · all editions*