feat(sinks): partition iceberg exports by day with conversation sort by philcunliffe · Pull Request #91 · hyparam/hypaware

philcunliffe · 2026-06-09T22:35:51Z

Summary

Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's:

Partition by day(primaryTimestampColumn) — a writer-owned default, derived independently of the cache's cachePartitioning. The cache partitions by conversation_id:identity, which sets an unbounded ~1-file-per-conversation floor that compaction can't beat. Day grain bounds file count by time (~dozens of multi-MB files/day) and prunes on time-range predicates.
Sort each day partition by the dataset's lookup columns (conversation_id-led, from its declared identity columns). A conversation lookup then prunes data files / row groups by conversation_id min/max — preserving lookup speed without the file-count cost of partitioning on it.

Rationale, the measurement that drove it, and the abandoned "inherit cachePartitioning" decision it replaces are in LLP 0022.

⚠️ Merge blocker — icebird pin

The clustering (#22) and read-pruning (#20/#21) benefits require a published icebird containing commit 3edb15b ("Scan pruning and sort-on-write"), currently on ~/workspace/icebird master but not yet on npm (latest is 0.8.8).

Publish icebird 3edb15b (e.g. 0.8.9 / 0.9.0)
Bump the package.json pin off 0.8.5 to that version (shared-engine bump — the cache rides the same icebird and gains read-pruning for free)

This is not a code blocker for review/CI: the change degrades gracefully on the committed 0.8.5 pin — partitioning and drift rejection work, and the sort order is recorded in metadata but inert (rows unsorted within files) until the pin moves. All tests and the new smoke pass on 0.8.5.

What's in here

Docs

llp/0022 rewritten to the day-grain + sort decision; xrefs added to llp/0014 (sinks) and llp/0003 (the helper promotion is core surface).

Core

Promote partitionSpecForDeclaration + validatePartitionSpecStability + the declaration type out of src/core/cache/iceberg to a shared src/core/iceberg, re-exported from src/core/index.js. Move + re-export; cache rewired; behavior identical.

format-iceberg

partitioning.js derives the day grain (from primaryTimestampColumn) + sort order (from the dataset's identity columns) per dataset at commit time.
commit.js creates the table with partitionSpec + sortOrder and rejects partition-spec drift on append (iceberg_partition_spec_drift).
table-format.js emits hyp_partition_spec + hyp_sort_order on the commit span.
maintenance.js compaction reframed: available via icebergRewrite, but not run in-daemon and not needed for a day grain (was "blocked by icebird").

Test plan

npm test — 809 pass / 0 fail (10 new: derivePartitioning derivation + drift, exercised through the real icebird write path via commitBatch over a real local-fs BlobStore — day partition, conversation sort, 2-files-for-2-days, drift rejected, no false drift).
npm run smoke -- iceberg_export_partitioned_local_fs — passes: drives the production sink, asserts day(message_created_at) spec + conversation_id-led sort, 4 rows → 2 day-partition files, and hyp_partition_spec / hyp_sort_order on the span.
/ref-check — 74 @ref annotations, 0 broken.

Note: pre-existing spool bug (not introduced here)

The existing iceberg_export_local_fs smoke fails in this environment because storage.appendRows → flushTable({force:true}) → readRows yields 0 rows. Confirmed to reproduce on pristine master (changes stashed) and on icebird 0.8.5 — i.e. independent of this work and of the icebird version. The new partitioned smoke sidesteps it by populating the cache via the direct write path. Worth a separate investigation.

🤖 Generated with Claude Code

Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's: partition by day(primaryTimestampColumn) — a writer-owned default, not the cache's conversation_id-identity cachePartitioning, which sets an unbounded ~1-file-per-conversation floor compaction can't beat — and sort each day partition by the dataset's lookup columns (conversation_id-led) so a conversation lookup prunes row groups by min/max instead of needing a partition per conversation. - Promote partitionSpecForDeclaration + validatePartitionSpecStability (and the declaration type) from src/core/cache/iceberg to a shared src/core/iceberg home, re-exported from src/core/index.js: they are core surface consumed by the registry, cache, plugin types, and now the export (LLP 0003). - format-iceberg derives the day grain + sort order per dataset at commit time, creates the table with both, and rejects partition-spec drift on append (iceberg_partition_spec_drift). Emits hyp_partition_spec and hyp_sort_order on commit spans. - Reframe maintenance compaction: available via icebergRewrite but not run in-daemon and not needed for a day grain (was "blocked by icebird"). Spec: LLP 0022 (rewritten from the abandoned cache-parity decision); xrefs in LLP 0014 and 0003. Tests: 10 (derivation + drift through the real icebird write path) plus a passing iceberg_export_partitioned_local_fs smoke asserting the layout and hyp_partition_spec. Clustering (icebird #22) and read pruning (#20/#21) require a published icebird containing commit 3edb15b; the package.json pin must move off 0.8.5 before those benefits land. The code degrades gracefully on 0.8.5 — partitioning and drift work; the sort order is recorded but inert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sinks): partition iceberg exports by day with conversation sort#91

feat(sinks): partition iceberg exports by day with conversation sort#91
philcunliffe wants to merge 1 commit into
masterfrom
worktree-iceberg-sink

philcunliffe commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Jun 9, 2026

Summary

⚠️ Merge blocker — icebird pin

What's in here

Test plan

Note: pre-existing spool bug (not introduced here)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant