Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 230 additions & 0 deletions IMPROVEMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# PageIndex: Bug Fixes & Improvements Report

## Overview

This document summarizes all changes made to the PageIndex repository. The work covers **12 bug fixes** (3 critical, 9 standard) and **6 feature improvements**, all verified with a new 45-test automated test suite.

No behavioral changes were made to the core LLM prompts or retrieval logic. All fixes target correctness, robustness, and developer experience.

---

## Critical Bug Fixes

### 1. Missing `import re` in `utils.py`

**Problem:** `utils.py` used `re.search()` and `re.finditer()` in `get_first_start_page_from_text()` and `get_last_start_page_from_text()` but never imported the `re` module. Any call to these functions would crash with `NameError: name 're' is not defined`.

**Fix:** Added `import re` to the top-level imports.

**Impact:** These functions are called during PDF page tag extraction. Without this fix, any PDF with page index tags would fail at runtime.

---

### 2. Variable shadowing in `fix_incorrect_toc()` corrupts results

**Problem:** In `fix_incorrect_toc()` > `process_and_check_item()`, the variable `list_index` (the TOC entry index) was overwritten inside a loop that iterates over page ranges:

```python
list_index = incorrect_item['list_index'] # correct TOC index
for page_index in range(prev_correct, next_correct+1):
list_index = page_index - start_index # OVERWRITES with page offset
```

The returned `list_index` would be wrong, causing the fix to update the wrong TOC entry silently.

**Fix:** Renamed the inner loop variable to `page_list_index`.

**Impact:** Without this fix, TOC correction could place fixes at wrong indices, producing silently corrupted output for any document where initial verification fails and the fix path is triggered.

---

### 3. `ChatGPT_API_with_finish_reason` returns wrong type on error

**Problem:** On success the function returns a tuple `(content, finish_reason)`, but on max retries exhaustion it returned just the string `"Error"`. Any caller that unpacks like `response, reason = ChatGPT_API_with_finish_reason(...)` would crash with `ValueError: not enough values to unpack`.

**Fix:** Changed to return `("Error", "error")` on failure.

**Impact:** Every call site that unpacks this return value (`extract_toc_content`, `toc_transformer`, `generate_toc_init`, `generate_toc_continue`) would crash if the API failed 10 times consecutively.

---

## Standard Bug Fixes

### 4. `chat_history` mutation across retries

**Problem:** Both `ChatGPT_API_with_finish_reason` and `ChatGPT_API` assigned `messages = chat_history` then appended to it. This mutated the caller's original list and accumulated duplicate messages across retries.

**Fix:** Changed to `messages = list(chat_history)` to create a shallow copy.

---

### 5. Infinite loop guard in `extract_toc_content` was non-functional

**Problem:** The guard `if len(chat_history) > 5` could never trigger because `chat_history` was recreated with exactly 2 items at the start of each loop iteration.

**Fix:** Added an external `attempt` counter that increments each iteration and breaks at `max_attempts = 10`.

---

### 6. `get_leaf_nodes` crashes on nodes without `nodes` key

**Problem:** `if not structure['nodes']` raises `KeyError` when a dict has no `nodes` key (common for leaf nodes in certain tree states).

**Fix:** Changed to `if not structure.get('nodes')`.

---

### 7. `extract_json` corrupts text containing the word "None"

**Problem:** `json_content.replace('None', 'null')` was a global string replace. A title like `"None of the Above"` would become `"null of the Above"`.

**Fix:** Replaced with context-aware regex that only substitutes `None` in JSON value positions (after `:`, `[`, `,`), not inside quoted strings.

---

### 8. `process_none_page_numbers` crashes on missing `page` key

**Problem:** `del item_copy['page']` and `del item['page']` would raise `KeyError` if the key doesn't exist.

**Fix:** Changed to `item_copy.pop('page', None)` and `item.pop('page', None)`.

---

### 9. `count_tokens` crashes when `model=None`

**Problem:** `tiktoken.encoding_for_model(None)` raises an error. Multiple call sites pass `model=None`.

**Fix:** Added `model = model or "gpt-4o"` as default fallback.

---

### 10. Bare `except:` in `page_index_md.py`

**Problem:** `except:` catches everything including `SystemExit` and `KeyboardInterrupt`, making it impossible to interrupt the process cleanly.

**Fix:** Changed to `except ImportError:`.

---

### 11. Duplicate imports

**Problem:** `import logging` appeared twice in `utils.py`; `import os` appeared twice in `page_index.py`.

**Fix:** Removed the duplicates.

---

### 12. `add_page_number_to_toc` called with list instead of string

**Problem:** `process_none_page_numbers` passed `page_contents` (a Python list) to `add_page_number_to_toc`, which expects a string for its `part` parameter. Python would convert the list to its repr `['...', '...']` in the f-string prompt, sending garbled text to the LLM. Also added a guard for empty/invalid `result` before indexing.

**Fix:** Changed to `''.join(page_contents)` and added `result and isinstance(result[0].get(...))` check.

---

## Feature Improvements

### 1. `pyproject.toml` for proper Python packaging

Added a standard `pyproject.toml` with:
- Package metadata (name, version, description, license, classifiers)
- Dependencies mirroring `requirements.txt` (with `>=` instead of `==` for flexibility)
- Optional `[dev]` dependencies (pytest, pytest-asyncio)
- Console script entry point: `run_pageindex` command
- Setuptools configuration to include both `pageindex/` package and root `run_pageindex.py`

Also added a `main()` function wrapper in `run_pageindex.py` so the console script entry point works.

**Benefit:** Users can `pip install .` or eventually `pip install pageindex` from PyPI.

---

### 2. Automated test suite (45 tests)

Created two test files:
- `tests/test_utils.py` (36 tests) -- covers `count_tokens`, `extract_json`, `get_leaf_nodes`, `get_first_start_page_from_text`, `get_last_start_page_from_text`, `sanitize_filename`, `list_to_tree`, `add_preface_if_needed`, `remove_fields`, `reorder_dict`, `convert_physical_index_to_int`, `convert_page_to_int`, `write_node_id`, `get_nodes`, `structure_to_list`
- `tests/test_page_index_md.py` (9 tests) -- covers `extract_nodes_from_markdown`, `extract_node_text_content`, `build_tree_from_nodes`, `clean_tree_for_output`

All 45 tests pass on Python 3.14 with pytest.

**Benefit:** Catches regressions automatically. Several of the bugs fixed above would have been caught immediately by these tests.

---

### 3. Exponential backoff with jitter for API retries

Replaced fixed `time.sleep(1)` in all three API functions (`ChatGPT_API`, `ChatGPT_API_with_finish_reason`, `ChatGPT_API_async`) with `_retry_delay(attempt)` that computes:

```
delay = min(base * 2^attempt, max_delay) + random(0, 1)
```

Delays: ~1.5s, ~2.7s, ~4.8s, ~8.5s, ... capped at ~60s.

**Benefit:** Better resilience to API rate limits. Avoids hammering the API with fixed 1-second retries under load.

---

### 4. Progress reporting for long-running processing

Added an optional `progress_callback(stage, message, **kwargs)` parameter to `page_index()` and to `config.yaml`. When provided, it is called at key stages:

| Stage | When |
|-------|------|
| `pages_loaded` | After PDF page extraction |
| `toc_detection` | Starting TOC detection |
| `verify_toc` | Verifying section positions |
| `build_tree` | Building the document tree |
| `node_ids` | Assigning node IDs |
| `summaries` | Generating node summaries |
| `description` | Generating document description |

Usage: `page_index(doc, progress_callback=lambda stage, msg, **kw: print(f"[{stage}] {msg}"))`

**Benefit:** Users processing large PDFs (which can take minutes with dozens of LLM calls) get visibility into progress.

---

### 5. Optional LLM response caching

Added in-memory response caching keyed on `(model, sha256(prompt + chat_history))`. Enable via:
- `page_index(doc, response_cache=True)` or
- `response_cache: true` in `config.yaml`

Cache is automatically cleaned up after `page_index_main()` returns. Also exposed `set_response_cache(True/False)` for manual control.

**Benefit:** Avoids redundant API calls during retries or re-processing of the same document. Reduces cost and latency during development.

---

### 6. `doc_description` works independently of node summaries

**Problem:** In `page_index_main()`, the `if_add_doc_description` check was nested inside the `if_add_node_summary == 'yes'` block. Setting `if_add_doc_description='yes'` with `if_add_node_summary='no'` would silently skip description generation.

**Fix:** Moved `if_add_doc_description` to a separate top-level check after the summary block. The result dict is now built first, then description is conditionally added.

**Benefit:** Users can generate document descriptions without the cost of per-node summaries.

---

## Files Changed

| File | Changes |
|------|---------|
| `pageindex/utils.py` | Bug fixes 1, 3, 4, 6, 7, 9, 11; Features 3, 5 |
| `pageindex/page_index.py` | Bug fixes 2, 5, 8, 11, 12; Features 4, 6 |
| `pageindex/page_index_md.py` | Bug fix 10 |
| `run_pageindex.py` | Feature 1 (added `main()` wrapper) |
| `pageindex/config.yaml` | Features 4, 5 (new config keys) |
| `pyproject.toml` | Feature 1 (new file) |
| `tests/__init__.py` | Feature 2 (new file) |
| `tests/test_utils.py` | Feature 2 (new file, 36 tests) |
| `tests/test_page_index_md.py` | Feature 2 (new file, 9 tests) |

## Test Results

```
45 passed, 0 failed (Python 3.14.0, pytest 9.0.2)
```

All imports verified, CLI `--help` verified, new features (backoff, caching, config) independently verified.
6 changes: 5 additions & 1 deletion pageindex/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,8 @@ max_token_num_each_node: 20000
if_add_node_id: "yes"
if_add_node_summary: "yes"
if_add_doc_description: "no"
if_add_node_text: "no"
if_add_node_text: "no"
# Optional: progress_callback(stage, message, **kwargs) for long-running PDF processing
progress_callback: null
# Optional: set to true to cache LLM responses in memory (faster re-runs, lower API cost)
response_cache: false
55 changes: 39 additions & 16 deletions pageindex/page_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,19 @@
import random
import re
from .utils import *
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


def _progress_report(opt, stage, message, **kwargs):
"""Call optional progress_callback(stage, message, **kwargs) if set on opt."""
callback = getattr(opt, "progress_callback", None)
if callable(callback):
try:
callback(stage, message, **kwargs)
except Exception:
pass


################### check title in page #########################################################
async def check_title_appearance(item, page_list, start_index=1, model=None):
title=item['title']
Expand Down Expand Up @@ -677,11 +686,11 @@ def process_none_page_numbers(toc_items, page_list, start_index=1, model=None):
continue

item_copy = copy.deepcopy(item)
del item_copy['page']
result = add_page_number_to_toc(page_contents, item_copy, model)
if isinstance(result[0]['physical_index'], str) and result[0]['physical_index'].startswith('<physical_index'):
item_copy.pop('page', None)
result = add_page_number_to_toc(''.join(page_contents), item_copy, model)
if result and isinstance(result[0].get('physical_index'), str) and result[0]['physical_index'].startswith('<physical_index'):
item['physical_index'] = int(result[0]['physical_index'].split('_')[-1].rstrip('>').strip())
del item['page']
item.pop('page', None)

return toc_items

Expand Down Expand Up @@ -1022,6 +1031,7 @@ async def process_large_node_recursively(node, page_list, opt=None, logger=None)
return node

async def tree_parser(page_list, opt, doc=None, logger=None):
_progress_report(opt, "toc_detection", "Detecting table of contents...")
check_toc_result = check_toc(page_list, opt)
logger.info(check_toc_result)

Expand All @@ -1042,12 +1052,14 @@ async def tree_parser(page_list, opt, doc=None, logger=None):
opt=opt,
logger=logger)

_progress_report(opt, "verify_toc", "Verifying section positions...")
toc_with_page_number = add_preface_if_needed(toc_with_page_number)
toc_with_page_number = await check_title_appearance_in_start_concurrent(toc_with_page_number, page_list, model=opt.model, logger=logger)

# Filter out items with None physical_index before post_processings
valid_toc_items = [item for item in toc_with_page_number if item.get('physical_index') is not None]

_progress_report(opt, "build_tree", "Building document tree...", total_pages=len(page_list))
toc_tree = post_processing(valid_toc_items, len(page_list))
tasks = [
process_large_node_recursively(node, page_list, opt, logger=logger)
Expand All @@ -1059,6 +1071,16 @@ async def tree_parser(page_list, opt, doc=None, logger=None):


def page_index_main(doc, opt=None):
from pageindex.utils import set_response_cache
if opt is not None and getattr(opt, "response_cache", False):
set_response_cache(True)
try:
return _page_index_main_impl(doc, opt)
finally:
set_response_cache(False)


def _page_index_main_impl(doc, opt=None):
logger = JsonLogger(doc)

is_valid_pdf = (
Expand All @@ -1069,42 +1091,43 @@ def page_index_main(doc, opt=None):
raise ValueError("Unsupported input type. Expected a PDF file path or BytesIO object.")

print('Parsing PDF...')
_progress_report(opt, "pages_loaded", "Extracting pages...")
page_list = get_page_tokens(doc)
_progress_report(opt, "pages_loaded", f"Extracted {len(page_list)} pages", total_pages=len(page_list))

logger.info({'total_page_number': len(page_list)})
logger.info({'total_token': sum([page[1] for page in page_list])})

async def page_index_builder():
structure = await tree_parser(page_list, opt, doc=doc, logger=logger)
_progress_report(opt, "node_ids", "Assigning node IDs...")
if opt.if_add_node_id == 'yes':
write_node_id(structure)
if opt.if_add_node_text == 'yes':
add_node_text(structure, page_list)
if opt.if_add_node_summary == 'yes':
_progress_report(opt, "summaries", "Generating node summaries...")
if opt.if_add_node_text == 'no':
add_node_text(structure, page_list)
await generate_summaries_for_structure(structure, model=opt.model)
if opt.if_add_node_text == 'no':
remove_structure_text(structure)
if opt.if_add_doc_description == 'yes':
# Create a clean structure without unnecessary fields for description generation
clean_structure = create_clean_structure_for_description(structure)
doc_description = generate_doc_description(clean_structure, model=opt.model)
return {
'doc_name': get_pdf_name(doc),
'doc_description': doc_description,
'structure': structure,
}
return {
result = {
'doc_name': get_pdf_name(doc),
'structure': structure,
}
if opt.if_add_doc_description == 'yes':
_progress_report(opt, "description", "Generating document description...")
clean_structure = create_clean_structure_for_description(structure)
result['doc_description'] = generate_doc_description(clean_structure, model=opt.model)
return result

return asyncio.run(page_index_builder())


def page_index(doc, model=None, toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None,
if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None):
if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None,
progress_callback=None, response_cache=None):

user_opt = {
arg: value for arg, value in locals().items()
Expand Down
2 changes: 1 addition & 1 deletion pageindex/page_index_md.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os
try:
from .utils import *
except:
except ImportError:
from utils import *

async def get_node_summary(node, summary_token_threshold=200, model=None):
Expand Down
Loading