Skip to content

Latest commit

 

History

History
690 lines (528 loc) · 21.8 KB

File metadata and controls

690 lines (528 loc) · 21.8 KB

06 — SDK Integration

Code is law. Every statement in this document is traced to a source file.

This document covers all three integration surfaces for EdgeParse:

  1. Rust CLIcrates/edgeparse-cli/src/main.rs
  2. Python SDKsdks/python/edgeparse/ + crates/edgeparse-python/src/lib.rs
  3. Node.js SDKsdks/node/src/ + crates/edgeparse-node/src/lib.rs
  4. Batch APIcrates/edgeparse-core/src/api/batch.rs

Integration surfaces overview

┌─────────────────────────────────────────────────────────┐
│                     edgeparse-core                      │
│  convert(path, config) → PdfDocument                    │
└──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼────────────────┐
           │               │                │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │  edgeparse  │ │  edgeparse  │ │  edgeparse  │
    │    -cli     │ │   -python   │ │    -node    │
    │  (Rust bin) │ │ (PyO3 ext.) │ │(NAPI-RS add)│
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │                │
      CLI flags       sdks/python/     sdks/node/
      (clap)          (maturin wrap)   (TS wrapper)

All three surfaces call the same edgeparse_core::convert() function and the same output renderers. Parameter names differ by convention (Rust: snake_case, Python: snake_case, TypeScript: camelCase) but map to identical ProcessingConfig fields.


1. Rust CLI

Source: crates/edgeparse-cli/src/main.rs
Binary name: edgeparse
Argument parser: clap 4 (derive macro)

Build & install

# Build
cargo build --release
# Binary: target/release/edgeparse

# Install to PATH
cargo install --path crates/edgeparse-cli

Basic usage

edgeparse [OPTIONS] <PDF_FILE>...

At least one <PDF_FILE> is required. Multiple files are processed in sequence.

Complete flag reference

All flags are defined in the Cli struct in main.rs. The mapping to ProcessingConfig fields is performed by build_config().

Core flags

Flag Type Default ProcessingConfig field Description
<PDF_FILE>... Vec<PathBuf> Input PDF files (required)
-o, --output-dir <DIR> Option<String> output_dir Output directory; if omitted, writes next to input
-f, --format <FMT> Option<String> json formats Comma-separated output formats (see below)
-p, --password <PW> Option<String> password Password for encrypted PDFs
--pages <RANGE> Option<String> pages Page range e.g. "1,3,5-7"
-q, --quiet bool false quiet Suppress log output

Format values

--format accepts a comma-separated list. When omitted the default is json.

Value OutputFormat variant Description
json Json Structured JSON with bounding boxes (default)
markdown Markdown Standard Markdown with GFM tables
markdown-with-html MarkdownWithHtml Markdown with HTML table fallback
markdown-with-images MarkdownWithImages Markdown with image references
html Html Full HTML5 document
text Text Plain UTF-8 text
pdf Pdf (not yet implemented, silently skipped)

Multiple formats: --format markdown,json

Layout & extraction flags

Flag Type Default ProcessingConfig field Description
--reading-order <ALGO> String xycut reading_order xycut or off
--table-method <METHOD> String default table_method default (ruling lines) or cluster (borderless)
--keep-line-breaks bool false keep_line_breaks Preserve original line breaks
--use-struct-tree bool false use_struct_tree Use tagged PDF structure tree
--include-header-footer bool false include_header_footer Include headers/footers in output
--sanitize bool false sanitize Enable PII sanitisation
--replace-invalid-chars <CH> String " " replace_invalid_chars Replacement for invalid Unicode

Image flags

Flag Type Default ProcessingConfig field Description
--image-output <MODE> String external image_output off, embedded (base64), external (files)
--image-format <FMT> String png image_format png or jpeg
--image-dir <DIR> Option<String> image_dir Directory for extracted image files

Page separator flags

Flag Type Default ProcessingConfig field
--markdown-page-separator <STR> Option<String> markdown_page_separator
--text-page-separator <STR> Option<String> text_page_separator
--html-page-separator <STR> Option<String> html_page_separator

Content safety flags

Flag Type Default Description
--content-safety-off <FLAGS> Option<String> Disable filters: all, hidden-text, off-page, tiny, hidden-ocg

Parsed by FilterConfig::apply_safety_off() in crates/edgeparse-core/src/api/filter.rs.

Hybrid backend flags

Flag Type Default ProcessingConfig field Description
--hybrid <BACKEND> String off hybrid off or docling-fast
--hybrid-mode <MODE> String auto hybrid_mode auto or full
--hybrid-url <URL> Option<String> hybrid_url Hybrid service endpoint
--hybrid-timeout <MS> u64 30000 hybrid_timeout Timeout in milliseconds
--hybrid-fallback bool false hybrid_fallback Fall back on hybrid error

build_config() mapping

build_config() in main.rs maps every Cli field to a ProcessingConfig. The only non-trivial mappings are:

cli.format  →  formats: Vec<OutputFormat>   (split on ',', parsed per-token)
cli.table_method  →  "cluster" ⇒ Cluster, else Default
cli.reading_order →  "off" ⇒ Off, else XyCut
cli.image_output  →  "off" ⇒ Off | "embedded" ⇒ Embedded | else External
cli.image_format  →  "jpeg" ⇒ Jpeg | else Png
cli.hybrid        →  "docling-fast" ⇒ DoclingFast | else Off
cli.hybrid_mode   →  "full" ⇒ Full | else Auto

Output file naming

write_outputs() writes one file per format:

<output_dir>/<stem>.<ext>

Where <stem> is the input filename without extension and <ext> is:

Format Extension
json .json
markdown* .md
html .html
text .txt

2. Python SDK

Package name: edgeparse
Requires: Python 3.9+
Source (pure Python): sdks/python/edgeparse/
Source (Rust extension): crates/edgeparse-python/src/lib.rs
Build tool: maturin ≥ 1.7
Module layout:

sdks/python/
├── pyproject.toml              # maturin build config, project metadata
└── edgeparse/
    ├── __init__.py             # Public API: convert(), convert_file()
    ├── _types.py               # String literal constants (FORMATS, etc.)
    ├── cli.py                  # argparse CLI entry point
    └── _edgeparse.so           # Native extension (built by maturin)

Installation

pip install edgeparse

# Or build from source:
cd sdks/python
pip install maturin
maturin develop --release

edgeparse.convert()

Defined in sdks/python/edgeparse/__init__.py.
Delegates to _edgeparse.convert() (PyO3 native, defined in crates/edgeparse-python/src/lib.rs).

def convert(
    input_path: str | Path,
    *,
    format: str = "markdown",
    pages: str | None = None,
    password: str | None = None,
    reading_order: str = "xycut",
    table_method: str = "default",
    image_output: str = "off",
) -> str

Parameters:

Parameter Type Default Description
input_path str | Path Path to the PDF file
format str "markdown" Output format: "markdown", "json", "html", "text"
pages str | None None Page range e.g. "1,3,5-7"
password str | None None Password for encrypted PDFs
reading_order str "xycut" "xycut" or "off"
table_method str "default" "default" or "cluster"
image_output str "off" "off", "embedded", or "external"

Returns: The extracted content as a str.

Raises: RuntimeError (mapped from EdgePdfError via PyO3) on parse failure.

Example:

import edgeparse

# Markdown extraction
md = edgeparse.convert("report.pdf")

# JSON with bounding boxes
json_str = edgeparse.convert("report.pdf", format="json")

# Specific pages, cluster table detection
md = edgeparse.convert(
    "report.pdf",
    format="markdown",
    pages="1-10",
    table_method="cluster",
)

# Password-protected PDF
md = edgeparse.convert("secure.pdf", password="hunter2")

edgeparse.convert_file()

Defined in sdks/python/edgeparse/__init__.py.
Delegates to _edgeparse.convert_file() (PyO3 native).

def convert_file(
    input_path: str | Path,
    output_dir: str | Path = "output",
    *,
    format: str = "markdown",
    pages: str | None = None,
    password: str | None = None,
) -> str

Parameters: Same as convert() plus output_dir. Note that reading_order, table_method, and image_output are not exposed here (use convert() for full control).

Returns: Path to the created output file as a str.

Example:

out_path = edgeparse.convert_file(
    "report.pdf",
    output_dir="extracted/",
    format="markdown",
)
print(f"Wrote {out_path}")

PyO3 extension internals

Source: crates/edgeparse-python/src/lib.rs

The PyO3 layer is thin — it maps Python string arguments to ProcessingConfig enum fields and calls edgeparse_core::convert():

#[pyfunction]
fn convert(input_path, format, pages, password,
           reading_order, table_method, image_output)
  → PyResult<String>

Enum mapping:

format "markdown" → OutputFormat::Markdown
format "json"     → OutputFormat::Json
format "html"     → OutputFormat::Html
format "text"     → OutputFormat::Text

reading_order "xycut" → ReadingOrder::XyCut
reading_order "off"   → ReadingOrder::Off

table_method "cluster" → TableMethod::Cluster
table_method "default" → TableMethod::Default

image_output "embedded" → ImageOutput::Embedded
image_output "external" → ImageOutput::External
image_output "off"      → ImageOutput::Off

Python CLI

Source: sdks/python/edgeparse/cli.py
Entry point: edgeparse console script (registered in pyproject.toml)

edgeparse <input.pdf> [<input.pdf> ...] [-o OUTPUT_DIR] [-f FORMAT]
          [--pages RANGE] [-p PASSWORD]
Flag Default Description
input (positional) One or more PDF files
-o, --output-dir output Output directory
-f, --format markdown Output format
--pages Page range
-p, --password PDF password

Note: The Python CLI calls convert_file() internally (not convert()), so reading_order, table_method, and image_output are not exposed as CLI flags in the Python wrapper — use the Rust CLI for full flag coverage.

Valid value constants

sdks/python/edgeparse/_types.py exports string literal tuples:

FORMATS        = ("markdown", "json", "html", "text")
READING_ORDERS = ("xycut", "off")
TABLE_METHODS  = ("default", "cluster")
IMAGE_OUTPUTS  = ("off", "embedded", "external")

3. Node.js SDK

Package name: edgeparse
Requires: Node.js ≥ 18
Source (TypeScript wrapper): sdks/node/src/
Source (Rust addon): crates/edgeparse-node/src/lib.rs
Build tool: NAPI-RS + tsup

Package structure

sdks/node/
├── package.json               # edgeparse, optionalDependencies per platform
├── tsconfig.json
├── src/
│   ├── index.ts               # convert(), version() — public API
│   ├── types.ts               # ConvertOptions interface
│   └── cli.ts                 # node:util parseArgs CLI
├── npm/
│   ├── darwin-arm64/          # platform package stubs
│   ├── darwin-x64/
│   ├── linux-arm64-gnu/
│   ├── linux-x64-gnu/
│   └── win32-x64-msvc/
└── tests/
    └── convert.test.ts        # vitest tests

Platform packages (edgeparse-{platform}) are loaded at runtime by loadNative() in index.ts using process.platform/process.arch as the lookup key.

Installation

npm install edgeparse
# or
yarn add edgeparse
# or
pnpm add edgeparse

The correct platform native addon is automatically selected via optionalDependencies in package.json.

convert()

Defined in sdks/node/src/index.ts:

import { convert } from 'edgeparse';

function convert(inputPath: string, options?: ConvertOptions): string

Returns: The extracted content as a string.

Throws: Error on unsupported platform or parse failure.

ConvertOptions

Defined in sdks/node/src/types.ts:

interface ConvertOptions {
  /** Output format: "markdown" | "json" | "html" | "text". Default: "markdown". */
  format?: string;
  /** Page range string, e.g. "1,3,5-7". */
  pages?: string;
  /** Password for encrypted PDFs. */
  password?: string;
  /** Reading order algorithm: "xycut" (default) or "off". */
  readingOrder?: string;
  /** Table detection method: "default" or "cluster". */
  tableMethod?: string;
  /** Image output mode: "off" (default), "embedded", or "external". */
  imageOutput?: string;
}

Note the camelCase naming convention (readingOrder, tableMethod, imageOutput) compared to the Python/Rust snake_case. index.ts maps these by writing an intermediate object with snake_case keys before passing to the native addon:

n.convert(inputPath, options ? {
  format:        options.format,
  pages:         options.pages,
  password:      options.password,
  reading_order: options.readingOrder,
  table_method:  options.tableMethod,
  image_output:  options.imageOutput,
} : undefined);

version()

import { version } from 'edgeparse';

function version(): string

Returns the edgeparse version string from the native addon.

Example usage

import { convert } from 'edgeparse';

// Markdown (default format)
const md = convert('report.pdf');

// JSON with bounding boxes
const json = convert('report.pdf', { format: 'json' });

// Cluster table detection, specific pages
const result = convert('report.pdf', {
  format: 'markdown',
  pages: '1-5',
  tableMethod: 'cluster',
  readingOrder: 'xycut',
});

// Password-protected
const secure = convert('secure.pdf', { password: 'hunter2' });

Node.js CLI

Source: sdks/node/src/cli.ts
Entry point: edgeparse binary (registered in package.jsonbin.edgeparse)

npx edgeparse [options] <input.pdf>
# or after install:
edgeparse [options] <input.pdf>
Flag Short Default Description
--format -f markdown Output format
--pages -p Page range
--password PDF password
--reading-order xycut Reading order algorithm
--table-method default Table detection method
--image-output off Image output mode
--output -o stdout Output file path
--version -v Print version and exit
--help -h Print help and exit

Uses node:util.parseArgs. Output goes to stdout by default; use -o <path> to write to a file.

NAPI-RS extension internals

Source: crates/edgeparse-node/src/lib.rs

#[napi]
pub struct ConvertOptions {
  pub format: Option<String>,
  pub pages: Option<String>,
  pub password: Option<String>,
  pub reading_order: Option<String>,
  pub table_method: Option<String>,
  pub image_output: Option<String>,
}

#[napi]
pub fn convert(input_path: String, options: Option<ConvertOptions>)
  -> napi::Result<String>

All fields are Option<String> — missing fields fall back to ProcessingConfig defaults.


4. Batch API (edgeparse-core)

Source: crates/edgeparse-core/src/api/batch.rs

The batch API is a Rust-only library API (not yet exposed through the Python or Node.js SDKs). It provides progress-tracking bulk processing over a list of PDF files.

Key types

BatchRequest

pub struct BatchRequest {
    pub files: Vec<PathBuf>,
    pub config: ProcessingConfig,
    pub output_dir: Option<PathBuf>,
}

Builder pattern:

let req = BatchRequest::new(files, config)
    .with_output_dir(PathBuf::from("output/"));

BatchFileResult

pub struct BatchFileResult {
    pub input_path: PathBuf,
    pub success: bool,
    pub error: Option<String>,
    pub duration: Duration,
    pub page_count: Option<u32>,
}

BatchResult

pub struct BatchResult {
    pub files: Vec<BatchFileResult>,
    pub total_duration: Duration,
}

Helper methods:

Method Returns Description
success_count() usize Number of successfully processed files
failure_count() usize Number of failed files
total_count() usize Total file count
avg_duration() Duration Average per-file processing time
summary() String Human-readable result summary

process_batch()

pub fn process_batch<F>(
    request: &BatchRequest,
    process_fn: F,
) -> BatchResult
where
    F: FnMut(&Path, &ProcessingConfig) -> Result<u32, String>

Calls process_fn for each file in request.files sequentially, collecting per-file results and total elapsed time.

Example (Rust):

use edgeparse_core::{convert, api::{batch::*, config::ProcessingConfig}};
use std::path::PathBuf;

let files = collect_pdf_files(Path::new("docs/"))?;
let config = ProcessingConfig::default();
let req = BatchRequest::new(files, config);

let result = process_batch(&req, |path, cfg| {
    let doc = convert(path, cfg).map_err(|e| e.to_string())?;
    Ok(doc.number_of_pages)
});

println!("{}", result.summary());
for f in &result.files {
    if !f.success {
        eprintln!("{}: {}", f.input_path.display(), f.error.as_deref().unwrap_or("unknown"));
    }
}

File collection helpers

// Non-recursive — PDF files in a single directory
pub fn collect_pdf_files(dir: &Path) -> Result<Vec<PathBuf>, EdgePdfError>

// Recursive — PDF files in all subdirectories
pub fn collect_pdf_files_recursive(dir: &Path) -> Result<Vec<PathBuf>, EdgePdfError>

Both return paths sorted alphabetically.


5. Parameter compatibility matrix

Parameter Rust CLI flag Python kwarg Node.js ConvertOptions
Output format --format format= format
Page range --pages pages= pages
Password --password password= password
Reading order --reading-order reading_order= readingOrder
Table method --table-method table_method= tableMethod
Image output --image-output image_output= imageOutput
Output dir --output-dir (convert_file only) (not exposed)
Sanitize --sanitize (not exposed) (not exposed)
Struct tree --use-struct-tree (not exposed) (not exposed)
Hybrid backend --hybrid (not exposed) (not exposed)
Content safety --content-safety-off (not exposed) (not exposed)

The Python and Node.js SDKs expose the six most-used parameters. Advanced options (hybrid, safety filters, PII sanitisation, tagged PDF) are only available via the Rust CLI or by calling edgeparse_core directly.


6. Default values comparison

Parameter Rust CLI default Python SDK default Node.js SDK default
format json markdown markdown
reading_order xycut xycut xycut
table_method default default default
image_output external off off

The Rust CLI defaults format to json (per build_config in main.rs when --format is omitted) and image_output to external. Both SDKs default to markdown and off respectively — optimised for in-memory programmatic use where callers usually want the full text and no image files.