What is PageIndex and how is it different from traditional RAG?

PageIndex is a vectorless, reasoning-based RAG system. Instead of splitting a document into chunks and searching an embedding space, it builds a hierarchical tree (an LLM-optimized table of contents) and lets an LLM reason over the structure to pick relevant sections, then fetches only those pages' raw text.

Does PageIndex use vector embeddings or a vector database?

No. PageIndex uses zero embeddings and no vector database. Relevance is decided by LLM reasoning over the document hierarchy, not by numeric similarity scores.

How accurate is PageIndex?

PageIndex reports 98.7% accuracy on the FinanceBench benchmark for financial document QA, which its creators say significantly outperforms traditional vector-based RAG on complex, structured documents.

When should I NOT use PageIndex?

For huge corpora of short, unstructured documents, or low-latency lookups where one extra LLM round trip is too expensive, classic vector search is usually cheaper and faster. PageIndex shines on long, structured, professional documents where reasoning beats similarity.

What file formats does PageIndex support?

PDFs and Markdown. PDFs are parsed into pages with token counts and a table of contents is detected or generated; Markdown is parsed directly from heading levels into a tree.

Is PageIndex explainable?

Yes. Because every retrieval is a traceable path through named sections with page ranges, you can audit exactly why the system navigated to a particular part of the document — useful for legal and financial use cases.

PageIndex: The Vectorless RAG That Reasons Through Documents Instead of Embedding Them

TL;DR

Traditional RAG splits documents into arbitrary chunks, embeds them, and retrieves by semantic similarity — but similarity ≠ relevance.
PageIndex is a vectorless, reasoning-based RAG system: it builds a hierarchical tree index (an LLM-friendly table of contents) and lets an LLM reason its way to the right section. Source: VectifyAI/PageIndex
No embeddings. No vector DB. No fixed-size chunks. No top-K cosine search.
It reported 98.7% accuracy on FinanceBench, a hard financial-document QA benchmark. Source: VectifyAI/PageIndex
The pipeline is two halves: ingestion (PDF/Markdown → tree JSON) and retrieval (LLM navigates the tree, then fetches only the pages it needs).

Why traditional RAG falls apart on long documents

You already know the standard RAG recipe: split a document into ~500-token chunks, embed every chunk into a vector, stuff them into a vector database, then at query time embed the question and pull the top-K nearest neighbors.

It works fine on FAQs and blog posts. It quietly breaks on 10-Ks, contracts, regulatory filings, research papers, and technical manuals. Here's why:

Chunking destroys structure. A document is naturally hierarchical — chapters, sections, subsections. Splitting it into fixed windows shreds that hierarchy, and ideas that span several paragraphs get cut in half.
Similarity ≠ relevance. Vector search returns text that sounds like your question. A query about clinical-trial results can pull text from the introduction just because the vocabulary overlaps. Source: DEV Community
No cross-referencing. When a section says "see Appendix G for the full table," a cosine-similarity retriever has no idea how to follow that pointer. Source: PageIndex Blog
Infrastructure tax. You're now running an embedding model, a vector store, and an indexing pipeline — plus tuning chunk size and overlap forever.

⚠️ The core problem: What you actually want from retrieval is relevance, and relevance requires reasoning. Cosine distance can't reason.

That's the gap PageIndex is built to close.

How PageIndex works: reason, don't embed

PageIndex was, by its authors' account, inspired by AlphaGo — the idea being that an LLM can search a tree of document structure the way a game engine searches moves, reasoning about which branch to explore. Source: VectifyAI/PageIndex

Instead of a flat pile of vectors, it represents a document as a tree of nodes. Each node carries a title, a node ID, a page range, and optionally a summary:

{
  "title": "Results",
  "node_id": "0006",
  "start_index": 10,
  "end_index": 14,
  "summary": "Reports primary and secondary outcomes of the trial...",
  "nodes": [
    {
      "title": "Primary Endpoint",
      "node_id": "0007",
      "start_index": 10,
      "end_index": 11
    }
  ]
}

Breaking it down:

title / node_id — human-readable label and a stable identifier for the section.
start_index / end_index — the physical page range this section maps to in the source PDF.
summary — an optional LLM-generated abstract used during navigation (so the agent can decide without reading the full text).
nodes — child sections, nested recursively. This is what preserves the hierarchy chunking throws away.

The whole system is two pipelines: build the tree once, then navigate it per query.

The ingestion pipeline: PDF/Markdown → tree

When you index a document, PageIndex parses it into pages (with token counts), figures out its structure, maps that structure to physical page ranges, and assembles a nested tree.

Walking the key stages:

Detect or generate a table of contents

For PDFs, PageIndex scans the first batch of pages looking for an existing table of contents. If it finds one, it extracts and cleans it into structured JSON (structure index, title, page). If there's no TOC, it groups page text into token-bounded chunks and generates a synthetic TOC from the text itself, prompting the LLM iteratively until the structure is complete.

Map sections to physical pages

A printed TOC says "Methods … 12," but the physical page index in the parsed PDF may differ (cover pages, front matter, offset). PageIndex prompts the LLM with page text to align each TOC entry to its real physical_index, then runs verification — flagging mismatched sections and re-prompting to correct them.

Build the tree (and split oversized nodes)

Once there's a flat list of sections with page ranges, post-processing nests children under parents and assigns each node its start_index / end_index. If a section is too big — over the per-node page or token cap — it's treated as a mini-document and split recursively. Markdown skips all the page-mapping work; the heading levels are the tree.

💡 Tip: For Markdown, structure is derived straight from heading depth (#, ##, ###), so ingestion is fast and deterministic — no TOC detection needed.

The retrieval pipeline: an agent navigates the tree

This is where PageIndex diverges hardest from classic RAG. There is no similarity search at query time. An LLM agent inspects the structure, decides which sections are relevant, and fetches only those pages.

The agent has three tools, and the order matters:

get_document() — confirm the doc exists and is indexed (metadata, status).
get_document_structure() — pull the tree without node text. The agent reads titles and ranges and reasons: "Which section title relates to my question?"
get_page_content(pages="x-y") — fetch raw text for a tight page range only.

So instead of "which chunks are semantically similar?", PageIndex asks "where would an expert look, and why?" — guided navigation through the document's own organization rather than blind top-K. Source: Shubham Shardul, Medium

And because the tree is structural, the agent can follow references. In one published example, a query about total deferred assets hit a section that only reported the increase; the text pointed to "Appendix G," the reasoning retriever navigated there, found the right table, and returned the total — a hop a vector retriever would likely miss. Source: PageIndex Blog

PageIndex vs traditional vector RAG

Dimension	Traditional Vector RAG	PageIndex (Vectorless)
Unit of retrieval	Fixed-size chunks	Document sections (tree nodes)
How relevance is decided	Cosine similarity, top-K	LLM reasoning over structure
Embeddings	✅ Required	❌ None
Vector database	✅ Required	❌ None
Document structure	❌ Destroyed by chunking	✅ Preserved as a tree
Cross-references ("see Appendix G")	❌ Usually missed	✅ Followable
Explainability	Low (opaque scores)	High (traceable section path)
Query-time cost	One ANN lookup	One or more LLM round trips
Best fit	Many short, unstructured docs	Long, structured professional docs

Architecture at a glance: the core modules

If you crack open the repo, the responsibilities split cleanly:

Module	Role
`page_index.py`	PDF indexing: TOC detection, page mapping, tree assembly, recursive splitting
`page_index_md.py`	Markdown indexing: parse headings into a tree directly
`retrieve.py`	The three retrieval helpers (`get_document`, `get_document_structure`, `get_page_content`)
`client.py`	`PageIndexClient` — manages a workspace, runs indexing, persists and serves results
`utils.py`	Config loading, token counting, and LLM I/O wrappers (via LiteLLM)
`config.yaml`	Defaults: TOC scan depth, per-node page/token caps, summary flags

A few design notes worth internalizing:

It's LLM-heavy. TOC detection, mapping, verification, and summaries are all prompts. On a 100-page document that can mean dozens of API calls. PageIndex uses async concurrency for checks and fixes, but the core mapping is fairly sequential, so indexing time scales roughly linearly with page count.
Token limits drive splitting. Page groups are kept under a max-token cap per node; oversized sections get recursively re-parsed so prompts never blow the context window.
Failure modes are mostly extraction-driven. If PDF text extraction misses content (bad OCR, multi-column layouts, fancy fonts), TOC detection can produce wrong sections. The pipeline includes verification and retry, and a final pass drops any section whose start index runs past the page count — but garbage in still risks garbage structure.

Important: PageIndex assumes the document's logical structure is reflected in a TOC or in headings. Feed it a PDF with no clear structure and the synthetic-TOC path will do its best, but results get shakier.

When to use PageIndex — and when not to

Reach for PageIndex when:

📄 You're working with long, structured professional documents — SEC filings, contracts, manuals, research papers.
🧠 Correctness and explainability matter more than raw latency (legal, financial, compliance).
🔗 Answers depend on cross-references between sections.
You want to drop the vector-DB stack and the chunk-tuning treadmill.

Stick with traditional vector RAG when:

You have a huge corpus of short, unstructured documents where there's no meaningful hierarchy to navigate.
You need sub-100ms lookups and can't afford an extra LLM round trip per query.
Your documents have no reliable structure (no TOC, no headings) for the tree to lean on.

The honest framing from the broader community: vector search isn't going away. The likely future is hybrid — similarity search for breadth, reasoning-based retrieval for the high-stakes, structure-heavy cases. Source: Microsoft Community Hub

Production checklist

Validate extraction first. Bad PDF text in means bad tree out — run a quick check that pages are extracting clean text before you trust the index.
Prefer Markdown when you control the source. Heading-derived trees are deterministic and skip TOC detection entirely.
Tune per-node caps (max_page_num_each_node, max_token_num_each_node) to your model's context budget — they govern when sections get recursively split.
Cache the index. Building the tree is the expensive part; persist it and avoid re-indexing on every query.
Keep page ranges tight at retrieval. Let the agent fetch a handful of pages, not whole chapters — that's the entire point of the structure.
Log the navigation path. The traceable section path is your audit trail; capture it for any high-stakes answer.

Conclusion

I keep coming back to one line from the PageIndex docs: similarity is not relevance. That's the whole pitch. For years we've papered over that gap with bigger embedding models and cleverer chunking, when the real fix on structured documents is to stop pretending a flat vector space captures a document that was written as a hierarchy.

PageIndex won't replace your vector DB for every workload, and the extra LLM round trips are a real cost. But for long, structured, high-stakes documents — the exact place classic RAG quietly fails — reasoning over a tree is a genuinely better mental model than guessing with cosine distance. If you've got a contract or a 10-K that vector RAG keeps fumbling, clone the repo, index one document, and watch the agent navigate it. That first traced retrieval path tends to be the moment it clicks.

TL;DR

Traditional RAG splits documents into arbitrary chunks, embeds them, and retrieves by semantic similarity — but similarity ≠ relevance.
PageIndex is a vectorless, reasoning-based RAG system: it builds a hierarchical tree index (an LLM-friendly table of contents) and lets an LLM reason its way to the right section. Source: VectifyAI/PageIndex
No embeddings. No vector DB. No fixed-size chunks. No top-K cosine search.
It reported 98.7% accuracy on FinanceBench, a hard financial-document QA benchmark. Source: VectifyAI/PageIndex
The pipeline is two halves: ingestion (PDF/Markdown → tree JSON) and retrieval (LLM navigates the tree, then fetches only the pages it needs).

Why traditional RAG falls apart on long documents

It works fine on FAQs and blog posts. It quietly breaks on 10-Ks, contracts, regulatory filings, research papers, and technical manuals. Here's why:

Chunking destroys structure. A document is naturally hierarchical — chapters, sections, subsections. Splitting it into fixed windows shreds that hierarchy, and ideas that span several paragraphs get cut in half.
Similarity ≠ relevance. Vector search returns text that sounds like your question. A query about clinical-trial results can pull text from the introduction just because the vocabulary overlaps. Source: DEV Community
No cross-referencing. When a section says "see Appendix G for the full table," a cosine-similarity retriever has no idea how to follow that pointer. Source: PageIndex Blog
Infrastructure tax. You're now running an embedding model, a vector store, and an indexing pipeline — plus tuning chunk size and overlap forever.

⚠️ The core problem: What you actually want from retrieval is relevance, and relevance requires reasoning. Cosine distance can't reason.

That's the gap PageIndex is built to close.

How PageIndex works: reason, don't embed

Instead of a flat pile of vectors, it represents a document as a tree of nodes. Each node carries a title, a node ID, a page range, and optionally a summary:

{
  "title": "Results",
  "node_id": "0006",
  "start_index": 10,
  "end_index": 14,
  "summary": "Reports primary and secondary outcomes of the trial...",
  "nodes": [
    {
      "title": "Primary Endpoint",
      "node_id": "0007",
      "start_index": 10,
      "end_index": 11
    }
  ]
}

Breaking it down:

title / node_id — human-readable label and a stable identifier for the section.
start_index / end_index — the physical page range this section maps to in the source PDF.
summary — an optional LLM-generated abstract used during navigation (so the agent can decide without reading the full text).
nodes — child sections, nested recursively. This is what preserves the hierarchy chunking throws away.

The whole system is two pipelines: build the tree once, then navigate it per query.

💡 Tip: For Markdown, structure is derived straight from heading depth (#, ##, ###), so ingestion is fast and deterministic — no TOC detection needed.

The retrieval pipeline: an agent navigates the tree

The agent has three tools, and the order matters:

get_document() — confirm the doc exists and is indexed (metadata, status).
get_document_structure() — pull the tree without node text. The agent reads titles and ranges and reasons: "Which section title relates to my question?"
get_page_content(pages="x-y") — fetch raw text for a tight page range only.

PageIndex vs traditional vector RAG

Dimension	Traditional Vector RAG	PageIndex (Vectorless)
Unit of retrieval	Fixed-size chunks	Document sections (tree nodes)
How relevance is decided	Cosine similarity, top-K	LLM reasoning over structure
Embeddings	✅ Required	❌ None
Vector database	✅ Required	❌ None
Document structure	❌ Destroyed by chunking	✅ Preserved as a tree
Cross-references ("see Appendix G")	❌ Usually missed	✅ Followable
Explainability	Low (opaque scores)	High (traceable section path)
Query-time cost	One ANN lookup	One or more LLM round trips
Best fit	Many short, unstructured docs	Long, structured professional docs

Architecture at a glance: the core modules

If you crack open the repo, the responsibilities split cleanly:

Module	Role
`page_index.py`	PDF indexing: TOC detection, page mapping, tree assembly, recursive splitting
`page_index_md.py`	Markdown indexing: parse headings into a tree directly
`retrieve.py`	The three retrieval helpers (`get_document`, `get_document_structure`, `get_page_content`)
`client.py`	`PageIndexClient` — manages a workspace, runs indexing, persists and serves results
`utils.py`	Config loading, token counting, and LLM I/O wrappers (via LiteLLM)
`config.yaml`	Defaults: TOC scan depth, per-node page/token caps, summary flags

A few design notes worth internalizing:

It's LLM-heavy. TOC detection, mapping, verification, and summaries are all prompts. On a 100-page document that can mean dozens of API calls. PageIndex uses async concurrency for checks and fixes, but the core mapping is fairly sequential, so indexing time scales roughly linearly with page count.
Token limits drive splitting. Page groups are kept under a max-token cap per node; oversized sections get recursively re-parsed so prompts never blow the context window.
Failure modes are mostly extraction-driven. If PDF text extraction misses content (bad OCR, multi-column layouts, fancy fonts), TOC detection can produce wrong sections. The pipeline includes verification and retry, and a final pass drops any section whose start index runs past the page count — but garbage in still risks garbage structure.

Important: PageIndex assumes the document's logical structure is reflected in a TOC or in headings. Feed it a PDF with no clear structure and the synthetic-TOC path will do its best, but results get shakier.

When to use PageIndex — and when not to

Reach for PageIndex when:

📄 You're working with long, structured professional documents — SEC filings, contracts, manuals, research papers.
🧠 Correctness and explainability matter more than raw latency (legal, financial, compliance).
🔗 Answers depend on cross-references between sections.
You want to drop the vector-DB stack and the chunk-tuning treadmill.

Stick with traditional vector RAG when:

You have a huge corpus of short, unstructured documents where there's no meaningful hierarchy to navigate.
You need sub-100ms lookups and can't afford an extra LLM round trip per query.
Your documents have no reliable structure (no TOC, no headings) for the tree to lean on.

Production checklist

Validate extraction first. Bad PDF text in means bad tree out — run a quick check that pages are extracting clean text before you trust the index.
Prefer Markdown when you control the source. Heading-derived trees are deterministic and skip TOC detection entirely.
Tune per-node caps (max_page_num_each_node, max_token_num_each_node) to your model's context budget — they govern when sections get recursively split.
Cache the index. Building the tree is the expensive part; persist it and avoid re-indexing on every query.
Keep page ranges tight at retrieval. Let the agent fetch a handful of pages, not whole chapters — that's the entire point of the structure.
Log the navigation path. The traceable section path is your audit trail; capture it for any high-stakes answer.