Skip to content

feat: add lance_write_fragments for local fragment creation#5

Merged
jja725 merged 8 commits intomainfrom
feat/fragment-apis
Apr 6, 2026
Merged

feat: add lance_write_fragments for local fragment creation#5
jja725 merged 8 commits intomainfrom
feat/fragment-apis

Conversation

@jja725
Copy link
Copy Markdown
Collaborator

@jja725 jja725 commented Apr 3, 2026

Summary

  • Adds lance_write_fragments(uri, stream, storage_opts) — a C/C++ API that writes an ArrowArrayStream to Lance fragment files at a given URI without committing a dataset manifest
  • Returns a JSON array of fragment metadata strings (freed with the existing lance_free_string()), which a separate Rust finalizer can deserialize and commit via CommitBuilder
  • Adds lance::write_fragments() C++ RAII wrapper in lance.hpp
  • Two new integration tests: round-trip write + JSON parse, and null-URI error path

Motivation

Enables efficient data ingestion from embedded/robotics C++ codebases (e.g. sensor pipelines) with minimal changes. The C++ process writes fragments locally; a separate Rust process finalizes them into a remote data lake:

// C++ robot/sensor process
ArrowArrayStream stream = ...;
const char* json = lance::write_fragments("file:///staging/robot.lance", &stream);
save_to_disk("fragments.json", json);
lance_free_string(json);
// Rust finalizer
let frags: Vec<Fragment> = serde_json::from_str(&json)?;
let txn = Transaction::new(0, Operation::Append { fragments: frags }, None);
CommitBuilder::new("s3://datalake/robot.lance").execute(txn).await?;

Test plan

  • cargo test — all 42 integration tests pass
  • cargo clippy --all-targets -- -D warnings — no warnings
  • test_write_fragments_returns_json — verifies JSON round-trips to Vec<Fragment> with correct row counts
  • test_write_fragments_null_uri_returns_null — verifies null safety and error reporting

Test Plan

Issues

@jja725
Copy link
Copy Markdown
Collaborator Author

jja725 commented Apr 3, 2026

@vicaya do you mind check this pr to see if it fit your use case?

Copy link
Copy Markdown

@vicaya vicaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jja725. Really appreciate your contributions to the ecosystem!

My main concern here is extra resource usage that's unnecessary for data logging/ingestion use cases.

}
};

let json = serde_json::to_string(&fragments).map_err(|e| lance_core::Error::Internal {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing to json seriously blow up the memory usage many folds for large arrays of fixed fp64 arrays. Looks like the only reason for you to do this is for testing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you would need some fragment metadata for sidecar. Or that's not needed and side car can just scan the whole directory and commit the fragment in a new transaction?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the side car can scan for *.lance (which is atomically renamed from a tmp file without the .lance suffix) and upload and/or commit the fragments. It can also use inotify as a more timely trigger. The scan is needed for process/node crash/restart anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Fragment struct being serialized here is pure metadata — it contains no actual data values. A typical serialized fragment looks like:

{"id": 123, "files": [{"path": "foobar.lance", "fields": [0], "column_indices": [], "file_major_version": 0, "file_minor_version": 3}], "deletion_file": null, "physical_rows": 1000}

Just file paths, field IDs, and row counts — typically a few hundred bytes per fragment. No fp64 arrays or column data are included in the serialization.

The fragment metadata lives in the lance manifest (protobuf), not in the data files themselves, so the JSON return is needed to pass this info to the commit step.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata only is better, but still too intrusive to the calling process. I prefer not having to dynamically allocate and deallocate the memory that can fragment the memory and cause issues down the road. These metadata is not needed in normal operations. BTW, I also prefer passing explicit schema for the fragment and fail fast if needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks!

BTW, which harness and model are you using for these PRs? Curious :)

Copy link
Copy Markdown
Collaborator Author

@jja725 jja725 Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. It's just plain claude code and it works pretty well for me with precise input

location: snafu::location!(),
})?;

let c_str = CString::new(json).map_err(|e| lance_core::Error::Internal {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blow up memory even more here.

location: snafu::location!(),
})?;

Ok(c_str.into_raw())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is IO error, caused by disk space full etc?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO errors (disk full, permission denied, etc.) from execute_uncommitted_stream are already propagated here: the ? on line 114 returns the lance_core::Error::IO variant, which flows back through ffi_try!set_lance_error() → maps to LanceErrorCode::IoError in thread-local storage, and the function returns NULL.

The C caller detects this via the documented pattern:

const char* json = lance_write_fragments(uri, &stream, NULL);
if (!json) {
    // lance_last_error_code() == LanceErrorCode::IoError
    // lance_last_error_message() == "disk full" (or similar OS message)
}

Add cross-language binding guidelines, naming conventions, error
handling, testing, and dependency management standards adapted from
the main Lance project.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jja725 jja725 force-pushed the feat/fragment-apis branch from 1e57f28 to c554925 Compare April 5, 2026 04:06
jja725 and others added 7 commits April 5, 2026 22:17
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exposes a C/C++ API for writing Arrow data to Lance fragment files
without committing them to a dataset manifest. Enables efficient
ingestion from embedded/robotics C++ codebases where a separate
Rust finalizer later commits the fragments to a remote data lake.

- lance_write_fragments(uri, stream, storage_opts) writes an
  ArrowArrayStream to fragment files and returns a JSON array of
  fragment metadata (freed with lance_free_string)
- Rust finalizer deserializes the JSON and commits via CommitBuilder
- C++ wrapper lance::write_fragments() in lance.hpp
- Two new integration tests: round-trip and null-URI error path
- No dynamic allocation returned to C++ caller — returns 0/-1 only
- Fragment metadata written as JSON sidecar to <uri>/_fragments/<uuid>.json
- Requires explicit ArrowSchema* parameter for fail-fast schema validation
- Rust finalizer reads sidecar files to commit via CommitBuilder
- Three tests: round-trip with sidecar, null-args error, schema mismatch
- Remove sidecar JSON write — Rust finalizer reconstructs Fragment
  metadata from .lance file footers instead
- Remove serde_json, object_store, uuid dependencies (no longer needed)
- Add robotics/embedded pipeline context to doc comments in both
  Rust source and C/C++ headers
- Tests verify data files are written under data/
Simulates the full ingestion pipeline:
1. C++ edge device writes sensor data via lance_write_fragments
2. Rust finalizer scans .lance files, reconstructs Fragment metadata
   from file footers (schema with field IDs, row counts, format version)
3. Commits fragments into a dataset via CommitBuilder
4. Verifies the committed dataset is readable with correct row count
@jja725 jja725 force-pushed the feat/fragment-apis branch from c554925 to 4be919b Compare April 6, 2026 05:17
@jja725 jja725 requested a review from vicaya April 6, 2026 05:17
@jja725 jja725 merged commit 5771d6f into main Apr 6, 2026
@jja725 jja725 deleted the feat/fragment-apis branch April 6, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants