Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK by KylinMountain · Pull Request #125 · VectifyAI/PageIndex

KylinMountain · 2026-02-28T12:30:54Z

What this PR adds

The upstream library provides page_index() and md_to_tree() for building document tree structures, but has no retrieval or QA layer. This PR adds that layer.

New: `pageindex/retrieve.py` — 3 retrieval tool functions

Three functions that expose structured document access:

tool_get_document(documents, doc_id) — metadata (name, description, type, page count)
tool_get_document_structure(documents, doc_id) — full tree JSON without text (token-efficient)
tool_get_page_content(documents, doc_id, pages) — page text by range ("5-7", "3,8", "12")

Works with both PDF (page numbers) and Markdown (line numbers).

New: `pageindex/client.py` — `PageIndexClient`

High-level SDK client:

index(file_path) — index a PDF or Markdown file, returns doc_id
query_agent(doc_id, prompt, verbose=False) — runs an OpenAI Agents SDK agent that calls the 3 tools autonomously to answer the question
query(doc_id, prompt) / query_stream(doc_id, prompt) — convenience wrappers
workspace parameter for JSON-based persistence across sessions

Demo: OpenAI Agents SDK

The agent navigates the document structure itself — no manual retrieval logic needed:

query_agent(doc_id, "What are the conclusions?")
  Turn 1: get_document()            → confirms status and page count
  Turn 2: get_document_structure()  → reads tree to find relevant sections
  Turn 3: get_page_content("10-13") → fetches targeted page content
  Turn 4: synthesizes answer

With verbose=True, each tool call (name, args, result preview) is printed in real time.

Test plan

pip install openai-agents
python test_client.py — downloads DeepSeek-R1 PDF, indexes it, runs agent query
client.query_agent(doc_id, "...", verbose=True) — observe tool call sequence
Restart Python, reload PageIndexClient(workspace=...) — query works without re-indexing

KylinMountain · 2026-02-28T12:34:09Z

Test result:

python test_client.py                     
Loaded 2 document(s) from workspace.
============================================================
Step 1: Indexing PDF and inspecting tree structure
============================================================
Indexing PDF: tests/pdfs/deepseek-r1.pdf
Parsing PDF...
start find_toc_pages



no toc found
process_no_toc
start_index: 1
divide page_list to groups 5
start generate_toc_init
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
Document validation: 86 pages, max allowed index: 86
start verify_toc
check all items
accuracy: 70.77%
start fix_incorrect_toc
Fixing 19 incorrect results
start fix_incorrect_toc with 19 incorrect results
Fixing 13 incorrect results
start fix_incorrect_toc with 13 incorrect results
Fixing 12 incorrect results
start fix_incorrect_toc with 12 incorrect results


Indexing complete. Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Tree Structure:
[0000] Introduction  —  The partial document discusses the development of reasoning ...
[0001] DeepSeek-R1-Zero  —  The partial document discusses the development of reasoning ...
  [0002] Group Relative Policy Optimization  —  The partial document discusses the development of reasoning ...
  [0003] Reward Design  —  The partial document describes the GRPO optimization method ...
  [0004] Incentivize Reasoning Capability in LLMs  —  The partial document describes the training and development ...
[0005] DeepSeek-R1  —  The partial document describes the development and multi-sta...
  [0006] Model-based Rewards  —  The partial document describes the development and training ...
  [0007] Training Details  —  The partial document discusses the training and methodology ...
    [0008] Training Details of the First RL Stage  —  The partial document describes the training and evaluation p...
    [0009] Training Details of the Second RL Stage  —  The partial document discusses the training and evaluation o...
[0010] Experiment  —  The partial document discusses the training and evaluation o...
[0011] Ethics and Safety Statement  —  The partial document discusses the ethical and safety consid...
[0012] Conclusion, Limitation, and Future Work  —  The partial document discusses the ethical considerations, s...
[0013] Author List  —  The partial document discusses the challenges and advancemen...
[0014] Appendix  —  The partial document provides an overview of DeepSeek V3, an...
[0015] Background  —  The partial document provides an overview of DeepSeek V3, an...
  [0016] DeepSeek-V3  —  The partial document provides an overview of DeepSeek V3, an...
  [0017] Conventional Post-Training Paradigm  —  The partial document provides an overview of DeepSeek-V3, an...
  [0018] A Comparison of GRPO and PPO  —  The partial document discusses the strengths and limitations...
[0019] Training Details  —  The partial document describes a reinforcement learning (RL)...
  [0020] RL Infrastructure  —  The partial document describes a reinforcement learning (RL)...
  [0021] Reward Model Prompt  —  The partial document covers the following main points:

1. *...
  [0022] Data Recipe  —  The partial document provides an overview of Reinforcement L...
    [0023] RL Data  —  The partial document provides a detailed description of Rein...
    [0024] DeepSeek-R1 Cold Start  —  The partial document describes the development and evaluatio...
    [0025] 800K Supervised Data  —  The partial document covers examples of solving basic arithm...
    [0026] SFT Data Statistics  —  The partial document primarily discusses the use of DeepSeek...
    [0027] Examples of SFT Trajectories  —  The partial document discusses the design principles and met...
  [0028] Hyper-Parameters  —  The partial document covers two main sections. The first sec...
    [0029] Hyper-Parameters of DeepSeek-R1-Zero-Qwen-32B  —  The partial document covers two main sections. The first sec...
    [0030] Hyper-Parameters of SFT  —  The partial document covers the following main points:

1. *...
    [0031] Hyper-Parameters of Distillation  —  The partial document covers the following main points:

1. *...
    [0032] Training Cost  —  The partial document discusses the phenomenon of reward hack...
  [0033] Reward Hacking  —  The partial document discusses the main points related to it...
  [0034] Ablation Study of Language Consistency Reward  —  The partial document covers the following main points:

1. *...
[0035] Self-Evolution of DeepSeek-R1-Zero  —  The partial document covers the following main points:

1. *...
  [0036] Evolution of Reasoning Capability in DeepSeek-R1-Zero during Training  —  The partial document discusses the main points related to it...
  [0037] Evolution of Advanced Reasoning Behaviors in DeepSeek-R1-Zero during Training  —  The partial document provides detailed insights into the tra...
[0038] Evaluation of DeepSeek-R1  —  The partial document discusses the evaluation of reasoning b...
  [0039] Experiment Setup  —  The partial document covers the evaluation of the DeepSeek-R...
  [0040] Main Results  —  The partial document provides an evaluation of the performan...
  [0041] DeepSeek-R1 Safety Report  —  The partial document discusses the performance and safety ev...
    [0042] Risk Control System for DeepSeek-R1  —  The partial document focuses on the safety assessment of the...
    [0043] R1 Safety Evaluation on Standard Benchmarks  —  The partial document focuses on the safety assessment and ri...
    [0044] Safety Taxonomic Study of R1 on In-House Benchmark  —  The partial document provides a detailed comparison of the D...
    [0045] Multilingual Safety Performance  —  The partial document focuses on evaluating the multilingual ...
    [0046] Robustness against Jailbreaking  —  The partial document provides a comparative analysis of the ...
[0047] More Analysis  —  The partial document provides a comparative analysis of two ...
  [0048] Performance Comparison with DeepSeek-V3  —  The partial document provides a comparative analysis of two ...
  [0049] Generalization to Real-World Competitions  —  The partial document provides an analysis of the performance...
  [0050] Mathematical Capabilities Breakdown by Categories  —  The partial document discusses the performance and computati...
  [0051] An Analysis on CoT Length  —  The partial document discusses the performance and computati...
  [0052] Performance of Each Stage on Problems of Varying Difficulty  —  The partial document discusses the limitations of majority v...
[0053] DeepSeek-R1 Distillation  —  The partial document discusses the limitations of majority v...
  [0054] Distillation v.s. Reinforcement Learning  —  The partial document discusses the effectiveness of distilla...
[0055] Discussion  —  The partial document discusses the main points related to it...
  [0056] Key Findings  —  The partial document discusses the evaluation and advancemen...
  [0057] Unsuccessful Attempts  —  The partial document discusses advanced methods for improvin...
[0058] Related Work  —  The partial document discusses the main points related to it...
  [0059] Chain-of-thought Reasoning  —  The partial document provides an in-depth analysis of the pe...
  [0060] Scaling Inference-time Compute  —  The partial document discusses various evaluation benchmarks...
  [0061] Reinforcement Learning for Reasoning Enhancement  —  The partial document discusses two evaluation benchmarks for...
[0062] Open Weights, Code, and Data  —  The partial document discusses various evaluation benchmarks...
[0063] Evaluation Prompts and Settings  —  The partial document provides an overview of various evaluat...
[0064] References  —  The partial document primarily covers advancements and resea...

============================================================
Step 2: Document Metadata (get_document)
============================================================
{"doc_id": "d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6", "doc_name": "deepseek-r1.pdf", "doc_description": "A comprehensive document detailing the development, training, evaluation, and safety considerations of the DeepSeek-R1 model and its variants, focusing on enhancing reasoning capabilities in large language models through reinforcement learning, supervised fine-tuning, and distillation, while addressing challenges like reward hacking, language consistency, and ethical risks.", "type": "pdf", "status": "completed", "page_count": 86}

============================================================
Step 3: Agent Query (auto tool-use)
============================================================

Question: 'What are the main conclusions of this paper?'

Answer:
The main conclusions of the paper are:

1. **Reasoning Potential and Training Strategy**:
   - DeepSeek-R1 models demonstrate strong reasoning abilities, emerging organically during reinforcement learning (RL) phases without extensive reliance on human annotation.
   - The key to unlocking this reasoning potential lies in providing hard problems, a reliable verifier, and ample computational resources for RL, rather than large-scale human annotation.

2. **Future Potential**:
   - With advanced RL techniques, AI systems like DeepSeek-R1 show potential to surpass human capabilities in tasks effectively evaluated by verifiers.
   - Integrating tools such as search engines, calculators, or real-world validation tools could greatly enhance reasoning capabilities and solution accuracy.

3. **Limitations**:
   - Structural Outputs: Current suboptimal outputs and lack of tool usage leave room for improvement.
   - Token Efficiency: Issues like overthinking persist, requiring optimization.
   - Language Mixing: The model struggles with languages outside English and Chinese.
   - Prompt Sensitivity: Few-shot prompting degrades performance, thus zero-shot prompting is recommended.
   - Software Engineering: Limited RL application in software engineering tasks leads to poor benchmarking in this domain.
   - RL Challenges: Reward hacking and difficulties in defining reliable reward structures remain significant challenges.

4. **Research Opportunities**:
   - Addressing the limitations in structural outputs, token usage, language proficiency, and reward modeling will be key areas of focus in future iterations of the model.

============================================================
Step 4: Persistence — reload without re-indexing
============================================================
Loaded 5 document(s) from workspace.

[Turn 1] → get_document({})
         ← {"doc_id": "ab91fc22-820d-4ffa-b971-20d292aaf2a0", "doc_name": "deepseek-r1.pdf", "doc_description": "A comprehensive document detailing the development, training, evaluation, and safety measures of t...

[Turn 2] → get_document_structure({})
         ← [{"title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", "start_index": 1, "end_index": 1, "nodes": [{"title": "Abstract", "start_index": 1, "end_index": 1, "no...
[non-fatal] Tracing: request failed: _ssl.c:993: The handshake operation timed out

[Turn 3] → get_page_content({"pages":"10-11"})
         ← [{"page": 10, "content": "5. Ethics and Safety Statement\nWith the advancement in the reasoning capabilities of DeepSeek-R1, we deeply recognize\nthe potential ethical risks. For example, R1 can be su...
Answer from reloaded client:
The main conclusions of the document about DeepSeek-R1 can be summarized as follows:

1. **Emergent Reasoning via RL**: The reinforcement learning (RL)-based approach enables the emergence of sophisticated reasoning behaviors, such as self-reflection and verification, during training, highlighting the potential of RL to unlock substantial reasoning capabilities in pre-trained models without large-scale human annotation.

2. **Model Limitations**:
   - **Structural Output and Tool Use**: DeepSeek-R1's structural output capabilities are suboptimal, and it lacks the ability to leverage external tools (e.g., compilers or search engines) for improved performance.
   - **Token Efficiency**: Instances of excessive reasoning remain, suggesting the need for optimization, although the model dynamically allocates computation based on task complexity.
   - **Language Mixing**: The model is optimized for Chinese and English, leading to challenges in other languages.
   - **Prompt Sensitivity**: Few-shot prompting degrades performance, with zero-shot prompting being recommended.
   - **Software Engineering Tasks**: Limited progress in software engineering due to challenges in scaling RL processes for such tasks.

3. **Challenges of Pure RL**:
   - **Reward Hacking**: Reliable reward models are critical for effective RL in complex tasks, but constructing such models is inherently difficult for open-ended tasks like writing.

4. **Future Directions**:
   - Improve structural output, tool integration, and token efficiency.
   - Develop robust reward models to facilitate RL in more complex domains.
   - Explore tool-augmented reasoning to enhance practical performance across tasks.

Overall, the work underscores that RL methods like DeepSeek-R1 hold potential for surpassing human capabilities in tasks where reliable verifiers exist, though challenges in feedback mechanisms and complex task evaluation remain pivotal research areas.

Persistence verified. ✓

Copilot

Pull request overview

Adds a retrieval + QA layer on top of the existing PageIndex tree builders by introducing tool-style retrieval functions and a high-level PageIndexClient that uses the OpenAI Agents SDK to autonomously navigate document structure and fetch relevant page/line content for answering questions.

Changes:

Added pageindex/retrieve.py with 3 JSON-returning retrieval tools: document metadata, token-efficient structure, and page/line content retrieval.
Added pageindex/client.py implementing PageIndexClient (indexing, workspace persistence, and agent-driven querying).
Added a runnable demo script (test_client.py) and updated exports/dependencies (pageindex/__init__.py, requirements.txt).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`test_client.py`	Demo script that downloads a PDF, indexes it, and runs agent queries including workspace reload.
`requirements.txt`	Adds `openai-agents` dependency for the agent-based client.
`pageindex/utils.py`	Adds streaming helper and tree/node printing/mapping utilities.
`pageindex/retrieve.py`	Implements the 3 retrieval tool functions (metadata, structure, page/line content).
`pageindex/client.py`	Introduces `PageIndexClient` with indexing, persistence, and OpenAI Agents SDK integration.
`pageindex/__init__.py`	Exposes retrieval tools and `PageIndexClient` at package top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pageindex/client.py

pageindex/__init__.py

pageindex/retrieve.py

test_client.py

pageindex/retrieve.py

pageindex/client.py

KylinMountain · 2026-03-16T10:17:29Z

Code review

Found 2 issues:

Unhandled KeyError when cached doc_id is stale — if demo_doc_id.txt exists but the corresponding workspace JSON has been deleted, client.documents[doc_id] raises an unhandled KeyError with no recovery path.

https://github.com/KylinMountain/PageIndex/blob/8a13890144e654ad02ed076781d56d0c46b5aba5/examples/openai_agents_demo.py#L141-L155

_parse_pages() silently returns an empty list for reversed ranges (e.g. "7-5") — range(7, 6) produces nothing, so the agent receives an empty result with no error or explanation, making it impossible to diagnose the problem.

https://github.com/KylinMountain/PageIndex/blob/8a13890144e654ad02ed076781d56d0c46b5aba5/pageindex/retrieve.py#L12-L22

🤖 Generated with Claude Code

_{If this code review was useful, please react with 👍. Otherwise, react with 👎.}

KylinMountain · 2026-03-20T09:20:09Z

Code review

Found 1 issue:

_get_md_page_content silently returns [] for almost all page range requests on Markdown documents. The function matches requested line numbers against exact line_num values of node headers. Since node headers are sparse (e.g. lines 1, 34, 52, ...), a request like pages="5-7" expands to {5, 6, 7} and will find nothing unless a header lands exactly on one of those lines — the caller receives an empty list with no error. This makes the Markdown path of get_page_content nearly unusable in practice.

PageIndex/pageindex/retrieve.py

Lines 58 to 79 in 72afe8b

    
           def _get_md_page_content(doc_info: dict, page_nums: list[int]) -> list[dict]: 
        
               """ 
        
               For Markdown documents, 'pages' are line numbers. 
        
               Find nodes whose line_num falls within the requested set and return their text. 
        
               """ 
        
               page_set = set(page_nums) 
        
               results = [] 
        
               seen = set() 
        
               def _traverse(nodes): 
        
                   for node in nodes: 
        
                       ln = node.get('line_num') 
        
                       if ln and ln in page_set and ln not in seen: 
        
                           seen.add(ln) 
        
                           results.append({'page': ln, 'content': node.get('text', '')}) 
        
                       if node.get('nodes'): 
        
                           _traverse(node['nodes']) 
        
               _traverse(doc_info.get('structure', [])) 
        
               results.sort(key=lambda x: x['page']) 
        
               return results

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

KylinMountain · 2026-03-20T09:25:29Z

The new output with tool calling reasoning.

============================================================
Step 3: Agent Query (auto tool-use)
============================================================

Question: 'What reward design does DeepSeek-R1-Zero use, and why was it chosen over supervised fine-tuning?'

I’ll inspect the document metadata and structure to locate the relevant section on DeepSeek-R1-Zero.
[tool] get_document({})

[output] {"doc_id": "cceecde2-4655-48f5-a175-6127ce48c4f8", "doc_name": "deepseek-r1.pdf", "doc_description": "A comprehensive document detailing the development, training, evaluation, and safety of the DeepSe...

I’ll check the document structure to find the pages discussing DeepSeek-R1-Zero and its reward design.
[tool] get_document_structure({})

[output] [{"title": "Introduction", "start_index": 1, "end_index": 2, "node_id": "0000", "summary": "The partial document discusses the development of reasoning capabilities in large language models (LLMs) thr...

I’ll fetch the reward-design pages to quote the exact reward setup and motivation.
[tool] get_page_content({"pages":"3-4"})

[output] [{"page": 3, "content": "For each question 𝑞, GRPO samples a group of outputs {𝑜1,𝑜2,···,𝑜𝐺}from the old policy\n𝜋𝜃𝑜𝑙𝑑and then optimizes the policy model𝜋 𝜃by maximizing the following objective:\nJ𝐺𝑅𝑃...

I’ll fetch the introductory comparison with supervised fine-tuning to state why they skipped it.
[tool] get_page_content({"pages":"1-2,14-16"})

[output] [{"page": 1, "content": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nGeneral reasoning represents a long-standing...

DeepSeek-R1-Zero uses a **rule-based reward design** with two equally weighted parts:

- **Accuracy reward**: checks whether the final answer is correct using verifiable signals like exact answer matching for math or test-case execution for code.
- **Format reward**: rewards putting the reasoning and answer into the required tags, especially **`<think>...</think>`** and **`<answer>...</answer>`**.

The paper states this as:
**Reward = Reward_acc + Reward_format**, with the same weight for both.

They chose this instead of supervised fine-tuning first because they wanted **pure RL to let reasoning emerge without human-imposed reasoning patterns**. The paper says SFT can limit exploration since human demonstrations may be biased or suboptimal and often omit useful behaviors like **reflection** and **verification**. By skipping SFT, they hoped the model could discover **novel reasoning strategies** on its own.

They also explicitly avoided **neural reward models** for these reasoning tasks because they found them vulnerable to **reward hacking**, and they add extra **compute cost and pipeline complexity**.

…gration

KylinMountain changed the title ~~Replace retrieve() with 3-tool OpenAI Agents SDK agent~~ Add retrieve function Feb 28, 2026

KylinMountain changed the title ~~Add retrieve function~~ Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK Feb 28, 2026

BukeLy requested a review from Copilot February 28, 2026 17:04

Copilot started reviewing on behalf of BukeLy February 28, 2026 17:05 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

KylinMountain mentioned this pull request Mar 10, 2026

How to retrieve or query? #25

Closed

KylinMountain added 5 commits March 20, 2026 19:17

Add PageIndexClient with retrieve, streaming support and litellm inte…

7e87b2c

…gration

Add litellm dependency and update default config

c23407c

Add OpenAI agents demo example

2f92b26

Update README with Python API usage and agent demo section

4f7157c

Support separate retrieve_model configuration for index and retrieve

80581be

KylinMountain force-pushed the feat/retrieve branch from 38fffc6 to 80581be Compare March 20, 2026 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125
KylinMountain wants to merge 5 commits intoVectifyAI:mainfrom
KylinMountain:feat/retrieve

KylinMountain commented Feb 28, 2026 •

edited

Loading

Uh oh!

KylinMountain commented Feb 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KylinMountain commented Mar 16, 2026

Uh oh!

KylinMountain commented Mar 20, 2026

Uh oh!

KylinMountain commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KylinMountain commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

New: pageindex/retrieve.py — 3 retrieval tool functions

New: pageindex/client.py — PageIndexClient

Demo: OpenAI Agents SDK

Test plan

Uh oh!

KylinMountain commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KylinMountain commented Mar 16, 2026

Code review

Uh oh!

KylinMountain commented Mar 20, 2026

Code review

Uh oh!

KylinMountain commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KylinMountain commented Feb 28, 2026 •

edited

Loading

New: `pageindex/retrieve.py` — 3 retrieval tool functions

New: `pageindex/client.py` — `PageIndexClient`

KylinMountain commented Feb 28, 2026 •

edited

Loading