feat: add lance_write_fragments for local fragment creation by jja725 · Pull Request #5 · lance-format/lance-c

jja725 · 2026-04-03T23:40:27Z

Summary

Adds lance_write_fragments(uri, stream, storage_opts) — a C/C++ API that writes an ArrowArrayStream to Lance fragment files at a given URI without committing a dataset manifest
Returns a JSON array of fragment metadata strings (freed with the existing lance_free_string()), which a separate Rust finalizer can deserialize and commit via CommitBuilder
Adds lance::write_fragments() C++ RAII wrapper in lance.hpp
Two new integration tests: round-trip write + JSON parse, and null-URI error path

Motivation

Enables efficient data ingestion from embedded/robotics C++ codebases (e.g. sensor pipelines) with minimal changes. The C++ process writes fragments locally; a separate Rust process finalizes them into a remote data lake:

// C++ robot/sensor process
ArrowArrayStream stream = ...;
const char* json = lance::write_fragments("file:///staging/robot.lance", &stream);
save_to_disk("fragments.json", json);
lance_free_string(json);

// Rust finalizer
let frags: Vec<Fragment> = serde_json::from_str(&json)?;
let txn = Transaction::new(0, Operation::Append { fragments: frags }, None);
CommitBuilder::new("s3://datalake/robot.lance").execute(txn).await?;

Test plan

cargo test — all 42 integration tests pass
cargo clippy --all-targets -- -D warnings — no warnings
test_write_fragments_returns_json — verifies JSON round-trips to Vec<Fragment> with correct row counts
test_write_fragments_null_uri_returns_null — verifies null safety and error reporting

Test Plan

Issues

jja725 · 2026-04-03T23:50:20Z

@vicaya do you mind check this pr to see if it fit your use case?

vicaya

Thanks for working on this @jja725. Really appreciate your contributions to the ecosystem!

My main concern here is extra resource usage that's unnecessary for data logging/ingestion use cases.

vicaya · 2026-04-04T01:39:07Z

src/fragment_writer.rs

+        }
+    };
+
+    let json = serde_json::to_string(&fragments).map_err(|e| lance_core::Error::Internal {


Serializing to json seriously blow up the memory usage many folds for large arrays of fixed fp64 arrays. Looks like the only reason for you to do this is for testing?

I think you would need some fragment metadata for sidecar. Or that's not needed and side car can just scan the whole directory and commit the fragment in a new transaction?

Yes, the side car can scan for *.lance (which is atomically renamed from a tmp file without the .lance suffix) and upload and/or commit the fragments. It can also use inotify as a more timely trigger. The scan is needed for process/node crash/restart anyway.

The Fragment struct being serialized here is pure metadata — it contains no actual data values. A typical serialized fragment looks like:

{"id": 123, "files": [{"path": "foobar.lance", "fields": [0], "column_indices": [], "file_major_version": 0, "file_minor_version": 3}], "deletion_file": null, "physical_rows": 1000}

Just file paths, field IDs, and row counts — typically a few hundred bytes per fragment. No fp64 arrays or column data are included in the serialization.

The fragment metadata lives in the lance manifest (protobuf), not in the data files themselves, so the JSON return is needed to pass this info to the commit step.

Metadata only is better, but still too intrusive to the calling process. I prefer not having to dynamically allocate and deallocate the memory that can fragment the memory and cause issues down the road. These metadata is not needed in normal operations. BTW, I also prefer passing explicit schema for the fragment and fail fast if needed.

lgtm. Thanks!

BTW, which harness and model are you using for these PRs? Curious :)

Thanks for the review. It's just plain claude code and it works pretty well for me with precise input

vicaya · 2026-04-04T01:39:41Z

src/fragment_writer.rs

+        location: snafu::location!(),
+    })?;
+
+    let c_str = CString::new(json).map_err(|e| lance_core::Error::Internal {


Blow up memory even more here.

vicaya · 2026-04-04T01:45:31Z

src/fragment_writer.rs

+        location: snafu::location!(),
+    })?;
+
+    Ok(c_str.into_raw())


What if there is IO error, caused by disk space full etc?

IO errors (disk full, permission denied, etc.) from execute_uncommitted_stream are already propagated here: the ? on line 114 returns the lance_core::Error::IO variant, which flows back through ffi_try! → set_lance_error() → maps to LanceErrorCode::IoError in thread-local storage, and the function returns NULL.

The C caller detects this via the documented pattern:

const char* json = lance_write_fragments(uri, &stream, NULL); if (!json) { // lance_last_error_code() == LanceErrorCode::IoError // lance_last_error_message() == "disk full" (or similar OS message) }

Add cross-language binding guidelines, naming conventions, error handling, testing, and dependency management standards adapted from the main Lance project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Exposes a C/C++ API for writing Arrow data to Lance fragment files without committing them to a dataset manifest. Enables efficient ingestion from embedded/robotics C++ codebases where a separate Rust finalizer later commits the fragments to a remote data lake. - lance_write_fragments(uri, stream, storage_opts) writes an ArrowArrayStream to fragment files and returns a JSON array of fragment metadata (freed with lance_free_string) - Rust finalizer deserializes the JSON and commits via CommitBuilder - C++ wrapper lance::write_fragments() in lance.hpp - Two new integration tests: round-trip and null-URI error path

- No dynamic allocation returned to C++ caller — returns 0/-1 only - Fragment metadata written as JSON sidecar to <uri>/_fragments/<uuid>.json - Requires explicit ArrowSchema* parameter for fail-fast schema validation - Rust finalizer reads sidecar files to commit via CommitBuilder - Three tests: round-trip with sidecar, null-args error, schema mismatch

- Remove sidecar JSON write — Rust finalizer reconstructs Fragment metadata from .lance file footers instead - Remove serde_json, object_store, uuid dependencies (no longer needed) - Add robotics/embedded pipeline context to doc comments in both Rust source and C/C++ headers - Tests verify data files are written under data/

Simulates the full ingestion pipeline: 1. C++ edge device writes sensor data via lance_write_fragments 2. Rust finalizer scans .lance files, reconstructs Fragment metadata from file footers (schema with field IDs, row counts, format version) 3. Commits fragments into a dataset via CommitBuilder 4. Verifies the committed dataset is readable with correct row count

vicaya suggested changes Apr 4, 2026

View reviewed changes

docs: add coding standards from upstream Lance AGENTS.md

1208dc7

Add cross-language binding guidelines, naming conventions, error handling, testing, and dependency management standards adapted from the main Lance project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jja725 force-pushed the feat/fragment-apis branch from 1e57f28 to c554925 Compare April 5, 2026 04:06

jja725 and others added 7 commits April 5, 2026 22:17

docs: add AGENTS.md and CLAUDE.md for coding agent guidance

1edb13a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: apply cargo fmt

85a0a39

fix: rustdoc invalid_rust_codeblocks warning in fragment_writer

4be919b

jja725 force-pushed the feat/fragment-apis branch from c554925 to 4be919b Compare April 6, 2026 05:17

jja725 requested a review from vicaya April 6, 2026 05:17

jja725 merged commit 5771d6f into main Apr 6, 2026

jja725 deleted the feat/fragment-apis branch April 6, 2026 05:18

Conversation

jja725 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Test Plan

Issues

Uh oh!

jja725 commented Apr 3, 2026

Uh oh!

vicaya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jja725 Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jja725 commented Apr 3, 2026 •

edited

Loading

jja725 Apr 6, 2026 •

edited

Loading