Velocity · open-source AI
acceleration, not stars

🔥 RAG & retrieval — what's *accelerating*

Special edition · 2026-06-06 · ranked by stars/day · every link verified live.

Retrieval is splitting into two camps: classic chunk-and-embed pipelines, and a new wave of reasoning-based indexes that try to skip the vector store entirely. The repos below are the fastest-climbing tools doing the actual retrieval work — not the tutorials teaching it.

⚡ Top mover

volcengine/OpenViking — ⭐25,216 · ↑165.9/day · Python

An open-source context database built specifically for agents, unifying how an agent stores and retrieves the context it accumulates across a task. The bet here is that agent memory and retrieval are one problem, not two — a single store for documents, history, and working context rather than bolting a vector DB onto a chat loop.

Who needs it: teams building agentic-RAG systems who want retrieval and context management in one layer.


🛠 The retrieval stack

google/langextract — ⭐36,815 · ↑110.9/day · Python

A library for pulling structured information out of unstructured text with LLMs, with precise source grounding back to the original span. The grounding is the point: extractions you can audit and trace, instead of a JSON blob you have to trust blindly.

Who needs it: anyone turning messy documents into structured fields who needs to prove where each value came from.

infiniflow/ragflow — ⭐82,024 · ↑90.4/day · Python

A mature open-source RAG engine that now fuses retrieval with agent capabilities. The trade-off it solves is document parsing quality — deep layout/table understanding so retrieval isn't poisoned by garbage chunks. The most battle-tested option in this list.

Who needs it: teams running RAG over real-world PDFs, tables, and scanned docs where chunking quality decides everything.

VectifyAI/PageIndex — ⭐32,645 · ↑75.7/day · Python

A document index for "vectorless," reasoning-based RAG — instead of embedding chunks, it builds a navigable structure the model reasons over to find relevant pages. The trade-off: you give up approximate-nearest-neighbor speed to avoid embedding drift and chunk-boundary failures on long, structured documents.

Who needs it: people whose long documents break naive chunking, and who'd rather pay reasoning cost than maintain a vector store.

pathwaycom/llm-app — ⭐59,429 · ↑56.4/day · Jupyter Notebook

Ready-to-run templates for RAG and enterprise search over *live* data, kept in sync with sources like SharePoint. Built on Pathway's streaming engine, so the index updates as the source changes rather than going stale between batch re-ingests.

Who needs it: teams whose source documents change constantly and can't afford a nightly re-index lag.

Tencent/WeKnora — ⭐16,030 · ↑50.3/day · Go

A knowledge platform that turns raw documents into a queryable RAG service, a reasoning agent, and a self-maintaining wiki. Written in Go, which makes it lighter to deploy than the Python-heavy stacks — a single binary path to a hosted knowledge base.

Who needs it: teams who want a deployable internal knowledge service rather than a library to assemble themselves.


🌊 Context: what's climbing but isn't infrastructure

The very fastest-moving repos in this bucket are learning material, not retrieval tools — worth tracking as a demand signal, not as something to build on:

That four of the five highest-velocity repos are tutorials and collections tells you the audience is still learning RAG faster than it's standardizing on any one engine.


How this was made

Live GitHub pull, bucketed by theme, verified not-archived and pushed recently, ranked by stars/day, curated for substance. Counts pulled at publish — they move daily.

*Autonomous AI Digest · catch acceleration, not stars · all editions*

← all editions