feat: add lance_write_fragments for local fragment creation#5
Conversation
|
@vicaya do you mind check this pr to see if it fit your use case? |
src/fragment_writer.rs
Outdated
| } | ||
| }; | ||
|
|
||
| let json = serde_json::to_string(&fragments).map_err(|e| lance_core::Error::Internal { |
There was a problem hiding this comment.
Serializing to json seriously blow up the memory usage many folds for large arrays of fixed fp64 arrays. Looks like the only reason for you to do this is for testing?
There was a problem hiding this comment.
I think you would need some fragment metadata for sidecar. Or that's not needed and side car can just scan the whole directory and commit the fragment in a new transaction?
There was a problem hiding this comment.
Yes, the side car can scan for *.lance (which is atomically renamed from a tmp file without the .lance suffix) and upload and/or commit the fragments. It can also use inotify as a more timely trigger. The scan is needed for process/node crash/restart anyway.
There was a problem hiding this comment.
The Fragment struct being serialized here is pure metadata — it contains no actual data values. A typical serialized fragment looks like:
{"id": 123, "files": [{"path": "foobar.lance", "fields": [0], "column_indices": [], "file_major_version": 0, "file_minor_version": 3}], "deletion_file": null, "physical_rows": 1000}Just file paths, field IDs, and row counts — typically a few hundred bytes per fragment. No fp64 arrays or column data are included in the serialization.
The fragment metadata lives in the lance manifest (protobuf), not in the data files themselves, so the JSON return is needed to pass this info to the commit step.
There was a problem hiding this comment.
Metadata only is better, but still too intrusive to the calling process. I prefer not having to dynamically allocate and deallocate the memory that can fragment the memory and cause issues down the road. These metadata is not needed in normal operations. BTW, I also prefer passing explicit schema for the fragment and fail fast if needed.
There was a problem hiding this comment.
lgtm. Thanks!
BTW, which harness and model are you using for these PRs? Curious :)
There was a problem hiding this comment.
Thanks for the review. It's just plain claude code and it works pretty well for me with precise input
src/fragment_writer.rs
Outdated
| location: snafu::location!(), | ||
| })?; | ||
|
|
||
| let c_str = CString::new(json).map_err(|e| lance_core::Error::Internal { |
src/fragment_writer.rs
Outdated
| location: snafu::location!(), | ||
| })?; | ||
|
|
||
| Ok(c_str.into_raw()) |
There was a problem hiding this comment.
What if there is IO error, caused by disk space full etc?
There was a problem hiding this comment.
IO errors (disk full, permission denied, etc.) from execute_uncommitted_stream are already propagated here: the ? on line 114 returns the lance_core::Error::IO variant, which flows back through ffi_try! → set_lance_error() → maps to LanceErrorCode::IoError in thread-local storage, and the function returns NULL.
The C caller detects this via the documented pattern:
const char* json = lance_write_fragments(uri, &stream, NULL);
if (!json) {
// lance_last_error_code() == LanceErrorCode::IoError
// lance_last_error_message() == "disk full" (or similar OS message)
}Add cross-language binding guidelines, naming conventions, error handling, testing, and dependency management standards adapted from the main Lance project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1e57f28 to
c554925
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exposes a C/C++ API for writing Arrow data to Lance fragment files without committing them to a dataset manifest. Enables efficient ingestion from embedded/robotics C++ codebases where a separate Rust finalizer later commits the fragments to a remote data lake. - lance_write_fragments(uri, stream, storage_opts) writes an ArrowArrayStream to fragment files and returns a JSON array of fragment metadata (freed with lance_free_string) - Rust finalizer deserializes the JSON and commits via CommitBuilder - C++ wrapper lance::write_fragments() in lance.hpp - Two new integration tests: round-trip and null-URI error path
- No dynamic allocation returned to C++ caller — returns 0/-1 only - Fragment metadata written as JSON sidecar to <uri>/_fragments/<uuid>.json - Requires explicit ArrowSchema* parameter for fail-fast schema validation - Rust finalizer reads sidecar files to commit via CommitBuilder - Three tests: round-trip with sidecar, null-args error, schema mismatch
- Remove sidecar JSON write — Rust finalizer reconstructs Fragment metadata from .lance file footers instead - Remove serde_json, object_store, uuid dependencies (no longer needed) - Add robotics/embedded pipeline context to doc comments in both Rust source and C/C++ headers - Tests verify data files are written under data/
Simulates the full ingestion pipeline: 1. C++ edge device writes sensor data via lance_write_fragments 2. Rust finalizer scans .lance files, reconstructs Fragment metadata from file footers (schema with field IDs, row counts, format version) 3. Commits fragments into a dataset via CommitBuilder 4. Verifies the committed dataset is readable with correct row count
c554925 to
4be919b
Compare
Summary
lance_write_fragments(uri, stream, storage_opts)— a C/C++ API that writes anArrowArrayStreamto Lance fragment files at a given URI without committing a dataset manifestlance_free_string()), which a separate Rust finalizer can deserialize and commit viaCommitBuilderlance::write_fragments()C++ RAII wrapper inlance.hppMotivation
Enables efficient data ingestion from embedded/robotics C++ codebases (e.g. sensor pipelines) with minimal changes. The C++ process writes fragments locally; a separate Rust process finalizes them into a remote data lake:
Test plan
cargo test— all 42 integration tests passcargo clippy --all-targets -- -D warnings— no warningstest_write_fragments_returns_json— verifies JSON round-trips toVec<Fragment>with correct row countstest_write_fragments_null_uri_returns_null— verifies null safety and error reportingTest Plan
Issues