feat(sinks): partition iceberg exports by day with conversation sort#91
Open
philcunliffe wants to merge 1 commit into
Open
feat(sinks): partition iceberg exports by day with conversation sort#91philcunliffe wants to merge 1 commit into
philcunliffe wants to merge 1 commit into
Conversation
Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's: partition by day(primaryTimestampColumn) — a writer-owned default, not the cache's conversation_id-identity cachePartitioning, which sets an unbounded ~1-file-per-conversation floor compaction can't beat — and sort each day partition by the dataset's lookup columns (conversation_id-led) so a conversation lookup prunes row groups by min/max instead of needing a partition per conversation. - Promote partitionSpecForDeclaration + validatePartitionSpecStability (and the declaration type) from src/core/cache/iceberg to a shared src/core/iceberg home, re-exported from src/core/index.js: they are core surface consumed by the registry, cache, plugin types, and now the export (LLP 0003). - format-iceberg derives the day grain + sort order per dataset at commit time, creates the table with both, and rejects partition-spec drift on append (iceberg_partition_spec_drift). Emits hyp_partition_spec and hyp_sort_order on commit spans. - Reframe maintenance compaction: available via icebergRewrite but not run in-daemon and not needed for a day grain (was "blocked by icebird"). Spec: LLP 0022 (rewritten from the abandoned cache-parity decision); xrefs in LLP 0014 and 0003. Tests: 10 (derivation + drift through the real icebird write path) plus a passing iceberg_export_partitioned_local_fs smoke asserting the layout and hyp_partition_spec. Clustering (icebird #22) and read pruning (#20/#21) require a published icebird containing commit 3edb15b; the package.json pin must move off 0.8.5 before those benefits land. The code degrades gracefully on 0.8.5 — partitioning and drift work; the sort order is recorded but inert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lay out
@hypaware/format-icebergexports for an archive's job, not the cache's:day(primaryTimestampColumn)— a writer-owned default, derived independently of the cache'scachePartitioning. The cache partitions byconversation_id:identity, which sets an unbounded ~1-file-per-conversation floor that compaction can't beat. Day grain bounds file count by time (~dozens of multi-MB files/day) and prunes on time-range predicates.conversation_id-led, from its declared identity columns). A conversation lookup then prunes data files / row groups byconversation_idmin/max — preserving lookup speed without the file-count cost of partitioning on it.Rationale, the measurement that drove it, and the abandoned "inherit
cachePartitioning" decision it replaces are in LLP 0022.The clustering (#22) and read-pruning (#20/#21) benefits require a published icebird containing commit
3edb15b("Scan pruning and sort-on-write"), currently on~/workspace/icebirdmaster but not yet on npm (latest is0.8.8).3edb15b(e.g.0.8.9/0.9.0)package.jsonpin off0.8.5to that version (shared-engine bump — the cache rides the same icebird and gains read-pruning for free)This is not a code blocker for review/CI: the change degrades gracefully on the committed
0.8.5pin — partitioning and drift rejection work, and the sort order is recorded in metadata but inert (rows unsorted within files) until the pin moves. All tests and the new smoke pass on0.8.5.What's in here
Docs
llp/0022rewritten to the day-grain + sort decision; xrefs added tollp/0014(sinks) andllp/0003(the helper promotion is core surface).Core
partitionSpecForDeclaration+validatePartitionSpecStability+ the declaration type out ofsrc/core/cache/icebergto a sharedsrc/core/iceberg, re-exported fromsrc/core/index.js. Move + re-export; cache rewired; behavior identical.format-iceberg
partitioning.jsderives the day grain (fromprimaryTimestampColumn) + sort order (from the dataset's identity columns) per dataset at commit time.commit.jscreates the table withpartitionSpec+sortOrderand rejects partition-spec drift on append (iceberg_partition_spec_drift).table-format.jsemitshyp_partition_spec+hyp_sort_orderon the commit span.maintenance.jscompaction reframed: available viaicebergRewrite, but not run in-daemon and not needed for a day grain (was "blocked by icebird").Test plan
npm test— 809 pass / 0 fail (10 new:derivePartitioningderivation + drift, exercised through the real icebird write path viacommitBatchover a real local-fs BlobStore — day partition, conversation sort, 2-files-for-2-days, drift rejected, no false drift).npm run smoke -- iceberg_export_partitioned_local_fs— passes: drives the production sink, assertsday(message_created_at)spec +conversation_id-led sort, 4 rows → 2 day-partition files, andhyp_partition_spec/hyp_sort_orderon the span./ref-check— 74@refannotations, 0 broken.Note: pre-existing spool bug (not introduced here)
The existing
iceberg_export_local_fssmoke fails in this environment becausestorage.appendRows→flushTable({force:true})→readRowsyields 0 rows. Confirmed to reproduce on pristine master (changes stashed) and on icebird 0.8.5 — i.e. independent of this work and of the icebird version. The new partitioned smoke sidesteps it by populating the cache via the direct write path. Worth a separate investigation.🤖 Generated with Claude Code