Skip to content

feat(sinks): partition iceberg exports by day with conversation sort#91

Open
philcunliffe wants to merge 1 commit into
masterfrom
worktree-iceberg-sink
Open

feat(sinks): partition iceberg exports by day with conversation sort#91
philcunliffe wants to merge 1 commit into
masterfrom
worktree-iceberg-sink

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Summary

Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's:

  • Partition by day(primaryTimestampColumn) — a writer-owned default, derived independently of the cache's cachePartitioning. The cache partitions by conversation_id:identity, which sets an unbounded ~1-file-per-conversation floor that compaction can't beat. Day grain bounds file count by time (~dozens of multi-MB files/day) and prunes on time-range predicates.
  • Sort each day partition by the dataset's lookup columns (conversation_id-led, from its declared identity columns). A conversation lookup then prunes data files / row groups by conversation_id min/max — preserving lookup speed without the file-count cost of partitioning on it.

Rationale, the measurement that drove it, and the abandoned "inherit cachePartitioning" decision it replaces are in LLP 0022.

⚠️ Merge blocker — icebird pin

The clustering (#22) and read-pruning (#20/#21) benefits require a published icebird containing commit 3edb15b ("Scan pruning and sort-on-write"), currently on ~/workspace/icebird master but not yet on npm (latest is 0.8.8).

  • Publish icebird 3edb15b (e.g. 0.8.9 / 0.9.0)
  • Bump the package.json pin off 0.8.5 to that version (shared-engine bump — the cache rides the same icebird and gains read-pruning for free)

This is not a code blocker for review/CI: the change degrades gracefully on the committed 0.8.5 pin — partitioning and drift rejection work, and the sort order is recorded in metadata but inert (rows unsorted within files) until the pin moves. All tests and the new smoke pass on 0.8.5.

What's in here

Docs

  • llp/0022 rewritten to the day-grain + sort decision; xrefs added to llp/0014 (sinks) and llp/0003 (the helper promotion is core surface).

Core

  • Promote partitionSpecForDeclaration + validatePartitionSpecStability + the declaration type out of src/core/cache/iceberg to a shared src/core/iceberg, re-exported from src/core/index.js. Move + re-export; cache rewired; behavior identical.

format-iceberg

  • partitioning.js derives the day grain (from primaryTimestampColumn) + sort order (from the dataset's identity columns) per dataset at commit time.
  • commit.js creates the table with partitionSpec + sortOrder and rejects partition-spec drift on append (iceberg_partition_spec_drift).
  • table-format.js emits hyp_partition_spec + hyp_sort_order on the commit span.
  • maintenance.js compaction reframed: available via icebergRewrite, but not run in-daemon and not needed for a day grain (was "blocked by icebird").

Test plan

  • npm test809 pass / 0 fail (10 new: derivePartitioning derivation + drift, exercised through the real icebird write path via commitBatch over a real local-fs BlobStore — day partition, conversation sort, 2-files-for-2-days, drift rejected, no false drift).
  • npm run smoke -- iceberg_export_partitioned_local_fspasses: drives the production sink, asserts day(message_created_at) spec + conversation_id-led sort, 4 rows → 2 day-partition files, and hyp_partition_spec / hyp_sort_order on the span.
  • /ref-check — 74 @ref annotations, 0 broken.

Note: pre-existing spool bug (not introduced here)

The existing iceberg_export_local_fs smoke fails in this environment because storage.appendRowsflushTable({force:true})readRows yields 0 rows. Confirmed to reproduce on pristine master (changes stashed) and on icebird 0.8.5 — i.e. independent of this work and of the icebird version. The new partitioned smoke sidesteps it by populating the cache via the direct write path. Worth a separate investigation.

🤖 Generated with Claude Code

Lay out @hypaware/format-iceberg exports for an archive's job, not the
cache's: partition by day(primaryTimestampColumn) — a writer-owned
default, not the cache's conversation_id-identity cachePartitioning,
which sets an unbounded ~1-file-per-conversation floor compaction can't
beat — and sort each day partition by the dataset's lookup columns
(conversation_id-led) so a conversation lookup prunes row groups by
min/max instead of needing a partition per conversation.

- Promote partitionSpecForDeclaration + validatePartitionSpecStability
  (and the declaration type) from src/core/cache/iceberg to a shared
  src/core/iceberg home, re-exported from src/core/index.js: they are
  core surface consumed by the registry, cache, plugin types, and now
  the export (LLP 0003).
- format-iceberg derives the day grain + sort order per dataset at
  commit time, creates the table with both, and rejects partition-spec
  drift on append (iceberg_partition_spec_drift). Emits hyp_partition_spec
  and hyp_sort_order on commit spans.
- Reframe maintenance compaction: available via icebergRewrite but not
  run in-daemon and not needed for a day grain (was "blocked by icebird").

Spec: LLP 0022 (rewritten from the abandoned cache-parity decision);
xrefs in LLP 0014 and 0003. Tests: 10 (derivation + drift through the
real icebird write path) plus a passing iceberg_export_partitioned_local_fs
smoke asserting the layout and hyp_partition_spec.

Clustering (icebird #22) and read pruning (#20/#21) require a published
icebird containing commit 3edb15b; the package.json pin must move off
0.8.5 before those benefits land. The code degrades gracefully on 0.8.5
— partitioning and drift work; the sort order is recorded but inert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant