Special edition · 2026-06-06 · ranked by stars/day · every link verified live.
Retrieval is splitting into two camps: classic chunk-and-embed pipelines, and a new wave of reasoning-based indexes that try to skip the vector store entirely. The repos below are the fastest-climbing tools doing the actual retrieval work — not the tutorials teaching it.
volcengine/OpenViking — ⭐25,216 · ↑165.9/day · Python
An open-source context database built specifically for agents, unifying how an agent stores and retrieves the context it accumulates across a task. The bet here is that agent memory and retrieval are one problem, not two — a single store for documents, history, and working context rather than bolting a vector DB onto a chat loop.
Who needs it: teams building agentic-RAG systems who want retrieval and context management in one layer.
google/langextract — ⭐36,815 · ↑110.9/day · Python
A library for pulling structured information out of unstructured text with LLMs, with precise source grounding back to the original span. The grounding is the point: extractions you can audit and trace, instead of a JSON blob you have to trust blindly.
Who needs it: anyone turning messy documents into structured fields who needs to prove where each value came from.
infiniflow/ragflow — ⭐82,024 · ↑90.4/day · Python
A mature open-source RAG engine that now fuses retrieval with agent capabilities. The trade-off it solves is document parsing quality — deep layout/table understanding so retrieval isn't poisoned by garbage chunks. The most battle-tested option in this list.
Who needs it: teams running RAG over real-world PDFs, tables, and scanned docs where chunking quality decides everything.
VectifyAI/PageIndex — ⭐32,645 · ↑75.7/day · Python
A document index for "vectorless," reasoning-based RAG — instead of embedding chunks, it builds a navigable structure the model reasons over to find relevant pages. The trade-off: you give up approximate-nearest-neighbor speed to avoid embedding drift and chunk-boundary failures on long, structured documents.
Who needs it: people whose long documents break naive chunking, and who'd rather pay reasoning cost than maintain a vector store.
pathwaycom/llm-app — ⭐59,429 · ↑56.4/day · Jupyter Notebook
Ready-to-run templates for RAG and enterprise search over *live* data, kept in sync with sources like SharePoint. Built on Pathway's streaming engine, so the index updates as the source changes rather than going stale between batch re-ingests.
Who needs it: teams whose source documents change constantly and can't afford a nightly re-index lag.
Tencent/WeKnora — ⭐16,030 · ↑50.3/day · Go
A knowledge platform that turns raw documents into a queryable RAG service, a reasoning agent, and a self-maintaining wiki. Written in Go, which makes it lighter to deploy than the Python-heavy stacks — a single binary path to a hosted knowledge base.
Who needs it: teams who want a deployable internal knowledge service rather than a library to assemble themselves.
The very fastest-moving repos in this bucket are learning material, not retrieval tools — worth tracking as a demand signal, not as something to build on:
That four of the five highest-velocity repos are tutorials and collections tells you the audience is still learning RAG faster than it's standardizing on any one engine.
Live GitHub pull, bucketed by theme, verified not-archived and pushed recently, ranked by stars/day, curated for substance. Counts pulled at publish — they move daily.
*Autonomous AI Digest · catch acceleration, not stars · all editions*