Skip to content

release 0.1.4: catalog discoverability + per-column profiles for all 249 specs#13

Merged
mprammer merged 38 commits into
developfrom
mp/release-0.1.4
May 17, 2026
Merged

release 0.1.4: catalog discoverability + per-column profiles for all 249 specs#13
mprammer merged 38 commits into
developfrom
mp/release-0.1.4

Conversation

@mprammer
Copy link
Copy Markdown
Contributor

0.1.4 turns the catalog into something you can navigate without scrolling
docs/v1/datasets.md, and ships per-column profiles for every slug. New
sources.json fields (tags, showcase) and derived parquet metadata
(size buckets, shape traits) drive TUI facets, a /-focused search input,
and matching list_datasets --tag / --showcase / --size / --trait /
--view / --inspect flags. Per-column profiles now ship under
docs/v1/profiles/<slug>.json for all 249 specs, so a fresh clone can
render the TUI Columns pane and run list_datasets --inspect <slug>
without rebuilding anything locally.

Two things a reviewer will want to verify against the diff.
DatasetSpec.family and the --family flag on build / convert are
removed — each slug is invoked by name now (or --all for whole-catalog
passes). And fetch.verify_tls is a new optional field (default true)
for upstreams whose TLS certs have rotted; WDI is the only slug that
currently sets it to false, with payload integrity gated by
expected_sha256. Full details in CHANGELOG.md.

🤖 Generated with Claude Code

mprammer and others added 30 commits May 12, 2026 16:41
Faceted TUI side panel, view presets, per-column profiles, domain tags +
showcase tiers, and matching CLI flags so newcomers can find "interesting
data" without scrolling docs/datasets.md.

* discovery.py: closed vocabs (12 tags, 4 tiers, 5 size buckets, 6 trait
  flags), FilterState dataclass + apply_preset, view-preset registry,
  bucket_for_size, _is_variant_field, sparkline, format_column_line.
  Shared by browse, list_datasets, validate_manifest, docs, profile.
* sources.schema.json: optional tags + showcase on DatasetSpec (default
  [], backwards-compatible). validate_manifest cross-checks the closed
  vocabs and warns on empty tiers only when at least one tier is
  populated (so --strict stays clean on the uncurated manifest).
* profile.schema.json + scripts/pipeline/profile.py: new opt-in stage
  `python -m scripts.pipeline.profile <slug>` writes
  outputs/v1/<slug>/profile.json with per-dtype stats — numeric
  histogram + NDV; string NDV + top-5 (NDV <= 256, skipped on binary);
  bool T/F/null; date/timestamp 10-bucket histogram; list/map length
  stats; struct/variant emit null. Idempotent against parquet sha256.
* docs.py: _snapshot_for_slug now writes size_bucket + shape_traits per
  slug; high_cardinality_present backfilled from profile.json when
  present. _render_curated_picks prepends a tier-grouped block to
  datasets.md.
* list_datasets.py: --tag / --showcase / --size / --trait (with !
  negation) / --view / --inspect / --tags-help / --showcase-help. Reuses
  FilterState.
* browse.py: left-docked facet panel (7 SelectionList groups + tri-state
  RadioSet for shape traits); view-preset bar with keys 1-4; new
  sortable Tags / Showcase / Size columns; "N of M" counts header; right
  detail-pane Columns section with sparklines (lazy-loaded profile.json,
  cached). C clears facets; f focuses panel.
* text_whitespace_parse: per-column int/float type inference so
  uci-seeds emits numerics for the profile stage to histogram against.
* README: replaces "Pick any other dataset" with a Discover subsection.
* Skills: raincloud-profile, raincloud-discover (new); raincloud-list-
  datasets, raincloud-build (updated).

Curation (~249 tags + showcase entries) is a separate follow-up; the
scaffolding degrades gracefully empty (preset views show 0 hits,
curated-picks blocks read "No picks yet").

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…lumns

Two unblockers for the overnight catalog-wide profile pass:

1. DuckDB was inferring DECIMAL types from the inlined `lo_f` / `hi_f`
   Python repr in the histogram-bucket SQL. For columns with a wide
   value range (e.g. 1694 / 0.27 in the `behavioral-risk-factor-...`
   dataset) the resulting DECIMAL(18,17) overflowed once multiplied
   by N_BUCKETS=10. Force `::DOUBLE` on every literal so the math
   stays in floating-point.

2. Some upstream CSVs (Kaggle exports of pandas DataFrames) ship an
   unnamed index column whose Arrow field has `name == ""`. The
   identifier quoter returned `""`, which DuckDB rejects as a
   zero-length delimited identifier. Skip those fields with a
   placeholder entry in `columns["__unnamed_column__"]`.

Also adds `scripts/pipeline/overnight_profile.py`: driver that walks
every unprofiled slug, runs build → profile → promote_profiles, and
wipes outputs/v1/<slug>/ + outputs/raw_downloads/<slug>/ between runs.
Logs to outputs/_overnight.log (JSONL, per-slug status + secs + error).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds tracked profile.json mirrors for every slug the overnight driver
has successfully built + profiled so far. Mix of UCI / Kaggle /
Anthropic / Public BI / 4 already-built giants (finemath-4plus,
hacker-news, websight-v01, wikipedia-en).

Per-slug failures (Public BI bz2 truncations, 30-min timeouts on huge
workloads, a handful of profile-side bugs) will be retried after the
main pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. DOUBLE-cast fix in profile.py confirmed
working — slugs that hit the DECIMAL(18,17) overflow on the first
attempt now succeed (e.g. behavioral-risk-factor-surveillance-system).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
DuckDB doesn't implement `CAST(time AS TIMESTAMP)`, so the temporal
path crashed on standalone TIME columns (hit on bi-trainsuk1's
v_Section_WTT_Time. Route TIME through the string profile instead
— null_count + NDV + top-K of the rendered HH:MM:SS form is more
useful than a stack trace.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
EOF
)
Continued overnight pass; TIME-column fix landed in cc16b9d. Driver
is roughly 50% through the candidate list.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
End of the overnight pass: 229 of 249 catalogue slugs now have
tracked profile.json files. Of the 20 remaining:

  17 - opted out (LARGE_BLOCKLIST: multi-hour builds incl. clickbench,
       jsonbench, wikipedia-structured-contents, openorca, laion-400m,
       slimpajama-6b, fineweb-sample-10bt, beir-msmarco, openlibrary-*,
       osm-germany-{nodes,relations}, stackoverflow-{posts,postlinks},
       nypd-complaints, ghcn-daily). Retry with a wider budget.
   2 - fixable retries running now (bi-cmsprovider, bi-trainsuk1).
   1 - UPSTREAM ROT: wdi.

WDI's fetch URL 301-redirects to databankfiles.worldbank.org, which
serves a TLS certificate that has expired (curl confirms). Marked in
the slug's fetch.notes with date + reproduction steps so a human
follow-up can chase an alternate export.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
bi-cmsprovider re-fetched cleanly (the earlier bz2 EOF was a transient
network truncation, not upstream rot). bi-trainsuk1 succeeded on the
TIME-column fallback added in cc16b9d.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Fresh clones need profile JSON for TUI sparklines without rebuilding everything; the mirror is idempotent so re-running is cheap and CI can audit drift via --check.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
autotag proposes tags for every slug from docs/v1/profiles/<slug>.json against the 13-entry TAG_VOCAB; _enrich_public_bi rewrites 45 templated Public BI descriptions with data-shape leads plus hand-curated Background notes.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The `family` field was removed from the manifest/schema/TUI and the
view-preset bar shipped with two tiers (`encoding`, `stress`) selectable
from the `View` row rather than four tiers on number-key bindings, so the
entry now matches what actually shipped.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Refreshes the tracked snapshot with the 2-tier showcase header, the tags column, the 45 Public BI workload description rewrites, the WDI rot note, and the parquet/vortex row counts and file sizes from the overnight rebuild pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…ck on --inspect

Adds --tag / --showcase / --size / --trait / --view / --inspect etc. for the catalog-discoverability pass, plus a fallback that lets --inspect read docs/v1/profiles/<slug>.json when the built profile isn't on disk — so a fresh clone can inspect any of the 230 slugs whose profile ships tracked.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds a /-focused search Input at the top of the main column with a small
query language: bare tokens match across slug/name/description/tags/columns/
license/handler/reader/fetch (substring, case-insensitive), field:value
clauses scope to one field with aliases (desc, tag, col, lic, ...), clauses
AND together, and unknown qualifiers fall through as bare tokens. The query
ANDs with existing facet state; parse errors are logged to
outputs/_browse_search_error.log and surfaced via notify() instead of
tearing down the TUI. Coerces every per-field value to str so explicit null
(e.g. license.notes: null) no longer regresses to None.

Reworks Columns-modal rendering for Sinhala / CJK / Arabic overflow:
_render_len charges non-ASCII letters/marks at max(2, cell_len) while
keeping block glyphs / arrows / em-dashes at 1, with a matching
_truncate_render_cells. Pane width is derived from app.size.width and the
modal's CSS geometry (the modal widened to 95%, col-list shrunk to 24,
2-cell ellipsis reserve, 11-cell slack). Adds a multi-row block-glyph
histogram, lo/mid/hi x-axis ticks (date-prefix for ISO timestamps,
3-sig-fig for floats), and a proportional top-value bar chart capped at
36 cells so binary distributions don't run the pane. Drops the yellow
modal outlines now that $surface contrast against the dimmed body is
enough affordance.

The modal's _profile_for() helper now falls back from the built
outputs/v{n}/<slug>/profile.json to the tracked
docs/v{n}/profiles/<slug>.json, so fresh clones see sparklines without
rebuilding. Tests cover the cell-width renderer across three pane sizes,
histogram / x-axis / top-value-bar geometry, and the full search-parser
matrix including the explicit-null regression.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Replace the old 12-tag subject-matter vocabulary (geospatial / nlp-text /
finance / etc.) with 13 data-kind tags grouped by content axis: string
content (urls, prose, enums, identifiers, code-strings), numeric content
(timestamps, embeddings, counts, monetary, measurements), and payload
structure (coordinates, binary-payload, nested-json). All 249/249 specs
now carry 1-3 inline tags emitted by the autotag classifier, and the
schema + discovery view presets are updated to match.

Rewrite 45 of 46 Public BI workload descriptions with a per-workload
data-shape lead (rows x cols, type-family mix, notable columns) plus a
hand-curated Background / Likely note grounded in actual column names
rather than the often-misleading workbook label (e.g. bi-romance is
Instagram posts, bi-physicians is CMS Medicare payments, bi-iglocations1
is US Census geographic codes).

Collapse SHOWCASE_TIERS from 4 editorial tiers to 2 (encoding + stress)
with 9 slugs across them; docs.py drops _KIND_BY_FAMILY and infers data
kind from parse.reader + transform.handler directly, and wraps
pq.ParquetFile in a try/except so a missing parquet falls through to
the snapshot. Also annotates WDI with a 2026-05-16 rot note for the
expired TLS cert on databankfiles.worldbank.org.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The manifest no longer carries `family` (handled in the catalog metadata
commit), so this drops the now-dead surface area in build/convert/status/
spec: the `--family` argparse flags, the `[family]` per-spec print header
in build.py, and the `iter_datasets(family=...)` kwarg. Each slug is
invoked by name now.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Copy edits across README / AGENTS / SKILLS / skill SKILL.md files /
minimal_spec.json so the human-readable docs reflect the post-`--family`
pipeline and the data-kind tag / 2-tier showcase vocab. No behavior change;
mirrors the catalog overhaul + family-removal commits that landed alongside
this one.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The prior `ndv <= 64 OR ratio <= 0.01` was too lax — on a 100M-row
dataset the ratio gate alone admits up to a million distinct string
values as an "enum", which is really a high-cardinality categorical.
The bare-fallback path also returned "enums" for any short string
column whose NDV was unknown, defaulting to enums on no evidence.

New rule requires BOTH small NDV (≤ 32) AND short mean_len (≤ 24),
with a looser high-row-count path gated on `ndv ≤ 256 AND ratio ≤
0.001 AND mean_len ≤ 24`. Short strings with unknown NDV now fall
through to no tag rather than defaulting to enums; short strings
with NDV > 256 are reclassified as identifiers. Slugs with at least
one column tagged "enums" drop from 181 to 165; "identifiers" rises
from 39 to 99; "prose" from 72 to 73. All 249 slugs still carry at
least one tag.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Three related fixes to embeddings tag detection. profile.py now renders list
element dtype recursively (list<float32>, large_list<int64>, list<struct>)
and adds a dedicated fixed_size_list branch that records the constant
list_size as length stats — previously fixed_size_list columns profiled as
null, hiding glove-6b-100d's 100-dim float vector entirely. autotag.py reads
the richer dtype label to classify list<float|double> as embeddings rather
than nested-json. The slug-name fallback now uses a word-bounded regex
(\b(embeddings?|word vectors?|dense vectors?|glove|word2vec|fasttext|encoder
output)\b) so descriptions like "sensors embedded in an Air Quality device"
no longer trigger the embeddings tag; the bare "vector" substring is dropped
to avoid matching "attack vector" / "supply vector". Re-profiling the eleven
locally-built list-bearing slugs and re-running autotag moves the
embeddings-tagged set from 5 to 6 (finepdfs-en-test newly tagged via its
four list<float64> score columns); uci-air-quality stays untagged.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The prior column-level tightening (ndv ≤ 32 AND mean_len ≤ 24) reduced
enums from 181 to 165, but residual breadth came from single class-label
string columns triggering the slug-level top-3 ranking — e.g. uci-iris
species, uci-wine-quality quality bucket, uci-sms-spam-collection ham/spam.
One class label does not make a dataset enum-shaped.

This adds a slug-level filter dropping 'enums' from the candidate set
unless >=2 columns qualify, matching the user's ">=50% of cols qualify"
concern with a simpler absolute floor. Also drops 'enums' from the
public_bi_merge handler fallback (per-workload BI slugs with profiles
get enums classified from their own column data). enums: 165 -> 128.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
After a successful profile run, scripts.pipeline.profile now calls
promote_profiles.promote() on the slugs it just processed so the tracked
docs/v1/profiles/ mirror stays in sync with outputs/v1/<slug>/profile.json
without humans remembering a second command. Idempotent — byte-identical
destinations are skipped, and a mirror-step exception is caught and
logged rather than re-raised so profile success isn't undone by a sync
glitch.

Adds a --no-promote opt-out for cases where the user is iterating and
doesn't want the diff churn in docs/v1/profiles/, and refreshes the
raincloud-profile skill card so agents see the new workflow plus the
tracked-mirror fallback path used by list_datasets --inspect and the TUI.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds an optional `fetch.verify_tls` boolean (default true) to the
manifest schema and wires it through fetch_http. When set to false on a
specific slug, the HTTPS download uses an SSL context with
check_hostname disabled and verify_mode set to CERT_NONE, and silences
the urllib3 InsecureRequestWarning for that fetch.

Intended as a documented escape hatch for upstreams whose certificates
have rotted while their payload is still served intact — integrity is
preserved via the existing expected_sha256 gate. Default behaviour is
unchanged: when verify_tls is unset or true, urllib uses its default
verifying context exactly as before.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The World Bank's databank.worldbank.org redirects WDI_CSV.zip to
databankfiles.worldbank.org, which has been serving an expired TLS cert
since at least May 2026. The payload itself is intact, so we bypass
verification by setting fetch.verify_tls=false on the WDI spec and lock
fetch integrity to expected_sha256=1c8ccf64...3bec (280,478,558 bytes).

Built and profiled at 395,276 rows × 70 columns; the v1 docs snapshot
and the WDI row in docs/v1/datasets.md now reflect real row counts and
file sizes again instead of dashes.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Wider-budget overnight pass (--include-large --budget-mins 30 --max-slugs 5)
landed 4 of the 17 LARGE_BLOCKLIST slugs:

  - beir-msmarco           (9.4M rows, 10s)
  - bi-commongovernment    (templated description re-enriched from profile)
  - clickbench-hits        (100M rows, 6.5 min)
  - fineweb-sample-10bt    (14.9M rows, 7 min)

ghcn-daily timed out at 30 min during fetch (3.7 GB tar.gz). The remaining
12 blocklist slugs (jsonbench-bluesky-100m, laion-400m, openlibrary-*,
openorca, osm-germany-{nodes,relations}, slimpajama-6b, stackoverflow-{
posts,postlinks}, wikipedia-structured-contents, nypd-complaints) need a
wider budget than this pass; queue them for a future overnight run.

bi-commongovernment description rewritten from the previously-templated CWI
boilerplate to the data-shape-led form (US federal contract / grant awards
data on USAspending.gov, 141.1M rows × 56 columns). autotag re-run picked up
the four new profiles and updated five slug tag sets accordingly.

235/249 profiles tracked (was 231). Remaining 14: 12 LARGE_BLOCKLIST + WDI
already profiled + ghcn-daily timeout to retry.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…, openlibrary-authors

Second targeted pass with --budget-mins 60 against the four most tractable
remaining LARGE_BLOCKLIST slugs. All four landed cleanly:

  - ghcn-daily              3.18B rows × 7 cols  (retry after r1 timeout)
  - laion-400m              2.82M rows × 8 cols
  - nypd-complaints         8.78M rows × 35 cols
  - openlibrary-authors    15.18M rows × 5 cols

239/249 profiles tracked. Remaining 10 are the heaviest of the heavy:
jsonbench-bluesky-100m, openlibrary-{editions,works}, openorca,
osm-germany-{nodes,relations}, slimpajama-6b, stackoverflow-{posts,
postlinks}, wikipedia-structured-contents. These need overnight budgets
in the 1.5-4h range per slug and remain queued for a future pass.

autotag re-run updated tags on the four new slugs to reflect their actual
column shape.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
mprammer and others added 8 commits May 17, 2026 00:26
The 0.1.4 entry was last refreshed against the catalog-discoverability
bundle on 2026-05-12; since then it has accumulated TUI search +
Columns-modal rendering refresh, per-slug verify_tls opt-in (WDI
re-enable), promote_profiles tooling + profile.py auto-promote,
list-element dtype rendering (incl. fixed_size_list — previously
silently skipped for glove-style embedding columns), autotag enums
tightening + embeddings false-positive fix, public BI workload
description rewrites (46 slugs), removal of DatasetSpec.family /
--family CLI flag / curate.py, and 10 newly-tracked profiles from the
LARGE_BLOCKLIST. Date bumped to 2026-05-17. Schema section gains the
new fetch.verify_tls field.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Overnight queue (--budget-mins 720) churned through the last 10
LARGE_BLOCKLIST slugs in 3h 32min wall-clock. All succeeded:

  - jsonbench-bluesky-100m       100M rows ×  1 col   (28 min)
  - openlibrary-editions          56M rows ×  5 cols  (58 min)
  - openlibrary-works             41M rows ×  5 cols  (16 min)
  - openorca                       4M rows ×  5 cols  (53 s)
  - osm-germany-nodes            433M rows ×  7 cols  (65 min)
  - osm-germany-relations        890k rows ×  5 cols  (26 s)
  - slimpajama-6b                5.5M rows ×  3 cols  (3 min)
  - stackoverflow-postlinks      6.5M rows ×  5 cols  (1 min)
  - stackoverflow-posts           58M rows × 16 cols  (9 min)
  - wikipedia-structured-contents 10M rows × 17 cols  (29 min)

Catalog reaches 249/249 tracked profiles — full coverage for the first
time. autotag picked up the 10 new profiles and refreshed tag sets
accordingly.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…arity

- Update "Tracked profiles" line to reflect 249/249 coverage (was the
  partial-coverage status with a "10 multi-hour giants queued" remnant).
- Drop session-state phrasing: "was 49 before this release window",
  "blocklist trio", "newly-locked", "Skipped: ..." (which read as "skipped
  from what?" without prior context).
- Drop internal stat deltas (e.g. "181 → 128 slug-level enum tags") and
  file-path listings (e.g. "Affects build.py, convert.py, spec.iter_datasets,
  status.py, docs.py:_KIND_BY_FAMILY") — the rule and the user-visible
  effect carry the entry, not the dev-loop intermediate state.
- Move `fixed_size_list` from a feature mention to its own Fixed bullet
  (it's a bug fix, not a designed addition).
- Consolidate the three `sources.schema.json` field additions into one
  sub-bulleted entry in the Schema section.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The sha256-based cache in profile.py locked in stale outputs from a
buggy pre-release version of the profiler — some string columns were
recorded as bare null even though the underlying parquet had populated
values. Add --force to bypass the cache when a re-profile is needed
without manually deleting profile.json.

Refresh 12 locally-buildable slugs (including the user-flagged
anthropic-economic-index, whose 6 string columns now carry the proper
null_count + ndv_approx + mean_length + top_values shape). The
profiles-with-at-least-one-null-value-column count drops from 44 to 43
on this branch; the remaining 43 are legitimate struct, all-null-typed,
or dictionary-encoded columns the profiler intentionally leaves null.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Python's `:.3g` switches to scientific notation as soon as the
exponent is ≥ 3, so `_format_axis_value` was painting histogram tick
labels like `1e+03`, `9.99e+03`, `1.23e+04` for counts and
measurements that the reader would naturally read as 1,000 / 9,990
/ 12,300. Floats in `[1, 100_000)` now render in standard form with
commas and ~3 sig figs after the leading digit (1.23, 12.3, 123,
1,234, 12,345). Outside that range the existing `:.3g` fallback
keeps very small / very large floats compact.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
`_render_column_detail` was telling the user to run
`python -m scripts.pipeline.profile <slug>` for every column whose
profile entry was null — including struct, variant, and all-null
columns that `profile.py` intentionally returns null for at the
column-map level. Re-running profile.py would produce the same null,
making the hint actively misleading.

Distinguish the two cases at render time: a non-empty
`profile_columns` map means the slug WAS profiled and this column was
deliberately skipped — show a brief "no per-element distribution"
hint pointing the reader at the row-group stats above instead.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Autofix landed I001 (sort imports), E401 (split multi-imports), and
F401 (unused imports) across 14 files. Manual fixes for the six issues
that needed judgement: drop unused `range_cells` / `rc2` locals, lift
the mid-file `import pytest` in two test files to the top, replace a
string-form `"Region"` return annotation with no annotation, replace
a string-form `"Path"` parameter annotation with the real `Path` type
plus the matching import. CI lint clean; 183 tests pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
@mprammer mprammer marked this pull request as ready for review May 17, 2026 15:29
@mprammer mprammer merged commit e93ad8a into develop May 17, 2026
5 checks passed
@mprammer mprammer deleted the mp/release-0.1.4 branch May 17, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant