release 0.1.4: catalog discoverability + per-column profiles for all 249 specs#13
Merged
Conversation
Faceted TUI side panel, view presets, per-column profiles, domain tags + showcase tiers, and matching CLI flags so newcomers can find "interesting data" without scrolling docs/datasets.md. * discovery.py: closed vocabs (12 tags, 4 tiers, 5 size buckets, 6 trait flags), FilterState dataclass + apply_preset, view-preset registry, bucket_for_size, _is_variant_field, sparkline, format_column_line. Shared by browse, list_datasets, validate_manifest, docs, profile. * sources.schema.json: optional tags + showcase on DatasetSpec (default [], backwards-compatible). validate_manifest cross-checks the closed vocabs and warns on empty tiers only when at least one tier is populated (so --strict stays clean on the uncurated manifest). * profile.schema.json + scripts/pipeline/profile.py: new opt-in stage `python -m scripts.pipeline.profile <slug>` writes outputs/v1/<slug>/profile.json with per-dtype stats — numeric histogram + NDV; string NDV + top-5 (NDV <= 256, skipped on binary); bool T/F/null; date/timestamp 10-bucket histogram; list/map length stats; struct/variant emit null. Idempotent against parquet sha256. * docs.py: _snapshot_for_slug now writes size_bucket + shape_traits per slug; high_cardinality_present backfilled from profile.json when present. _render_curated_picks prepends a tier-grouped block to datasets.md. * list_datasets.py: --tag / --showcase / --size / --trait (with ! negation) / --view / --inspect / --tags-help / --showcase-help. Reuses FilterState. * browse.py: left-docked facet panel (7 SelectionList groups + tri-state RadioSet for shape traits); view-preset bar with keys 1-4; new sortable Tags / Showcase / Size columns; "N of M" counts header; right detail-pane Columns section with sparklines (lazy-loaded profile.json, cached). C clears facets; f focuses panel. * text_whitespace_parse: per-column int/float type inference so uci-seeds emits numerics for the profile stage to histogram against. * README: replaces "Pick any other dataset" with a Discover subsection. * Skills: raincloud-profile, raincloud-discover (new); raincloud-list- datasets, raincloud-build (updated). Curation (~249 tags + showcase entries) is a separate follow-up; the scaffolding degrades gracefully empty (preset views show 0 hits, curated-picks blocks read "No picks yet"). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…lumns Two unblockers for the overnight catalog-wide profile pass: 1. DuckDB was inferring DECIMAL types from the inlined `lo_f` / `hi_f` Python repr in the histogram-bucket SQL. For columns with a wide value range (e.g. 1694 / 0.27 in the `behavioral-risk-factor-...` dataset) the resulting DECIMAL(18,17) overflowed once multiplied by N_BUCKETS=10. Force `::DOUBLE` on every literal so the math stays in floating-point. 2. Some upstream CSVs (Kaggle exports of pandas DataFrames) ship an unnamed index column whose Arrow field has `name == ""`. The identifier quoter returned `""`, which DuckDB rejects as a zero-length delimited identifier. Skip those fields with a placeholder entry in `columns["__unnamed_column__"]`. Also adds `scripts/pipeline/overnight_profile.py`: driver that walks every unprofiled slug, runs build → profile → promote_profiles, and wipes outputs/v1/<slug>/ + outputs/raw_downloads/<slug>/ between runs. Logs to outputs/_overnight.log (JSONL, per-slug status + secs + error). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds tracked profile.json mirrors for every slug the overnight driver has successfully built + profiled so far. Mix of UCI / Kaggle / Anthropic / Public BI / 4 already-built giants (finemath-4plus, hacker-news, websight-v01, wikipedia-en). Per-slug failures (Public BI bz2 truncations, 30-min timeouts on huge workloads, a handful of profile-side bugs) will be retried after the main pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. DOUBLE-cast fix in profile.py confirmed working — slugs that hit the DECIMAL(18,17) overflow on the first attempt now succeed (e.g. behavioral-risk-factor-surveillance-system). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
DuckDB doesn't implement `CAST(time AS TIMESTAMP)`, so the temporal path crashed on standalone TIME columns (hit on bi-trainsuk1's v_Section_WTT_Time. Route TIME through the string profile instead — null_count + NDV + top-K of the rendered HH:MM:SS form is more useful than a stack trace. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com> EOF )
Continued overnight pass; TIME-column fix landed in cc16b9d. Driver is roughly 50% through the candidate list. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Continued overnight pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
End of the overnight pass: 229 of 249 catalogue slugs now have
tracked profile.json files. Of the 20 remaining:
17 - opted out (LARGE_BLOCKLIST: multi-hour builds incl. clickbench,
jsonbench, wikipedia-structured-contents, openorca, laion-400m,
slimpajama-6b, fineweb-sample-10bt, beir-msmarco, openlibrary-*,
osm-germany-{nodes,relations}, stackoverflow-{posts,postlinks},
nypd-complaints, ghcn-daily). Retry with a wider budget.
2 - fixable retries running now (bi-cmsprovider, bi-trainsuk1).
1 - UPSTREAM ROT: wdi.
WDI's fetch URL 301-redirects to databankfiles.worldbank.org, which
serves a TLS certificate that has expired (curl confirms). Marked in
the slug's fetch.notes with date + reproduction steps so a human
follow-up can chase an alternate export.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
bi-cmsprovider re-fetched cleanly (the earlier bz2 EOF was a transient network truncation, not upstream rot). bi-trainsuk1 succeeded on the TIME-column fallback added in cc16b9d. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Fresh clones need profile JSON for TUI sparklines without rebuilding everything; the mirror is idempotent so re-running is cheap and CI can audit drift via --check. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
autotag proposes tags for every slug from docs/v1/profiles/<slug>.json against the 13-entry TAG_VOCAB; _enrich_public_bi rewrites 45 templated Public BI descriptions with data-shape leads plus hand-curated Background notes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The `family` field was removed from the manifest/schema/TUI and the view-preset bar shipped with two tiers (`encoding`, `stress`) selectable from the `View` row rather than four tiers on number-key bindings, so the entry now matches what actually shipped. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Refreshes the tracked snapshot with the 2-tier showcase header, the tags column, the 45 Public BI workload description rewrites, the WDI rot note, and the parquet/vortex row counts and file sizes from the overnight rebuild pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…ck on --inspect Adds --tag / --showcase / --size / --trait / --view / --inspect etc. for the catalog-discoverability pass, plus a fallback that lets --inspect read docs/v1/profiles/<slug>.json when the built profile isn't on disk — so a fresh clone can inspect any of the 230 slugs whose profile ships tracked. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds a /-focused search Input at the top of the main column with a small
query language: bare tokens match across slug/name/description/tags/columns/
license/handler/reader/fetch (substring, case-insensitive), field:value
clauses scope to one field with aliases (desc, tag, col, lic, ...), clauses
AND together, and unknown qualifiers fall through as bare tokens. The query
ANDs with existing facet state; parse errors are logged to
outputs/_browse_search_error.log and surfaced via notify() instead of
tearing down the TUI. Coerces every per-field value to str so explicit null
(e.g. license.notes: null) no longer regresses to None.
Reworks Columns-modal rendering for Sinhala / CJK / Arabic overflow:
_render_len charges non-ASCII letters/marks at max(2, cell_len) while
keeping block glyphs / arrows / em-dashes at 1, with a matching
_truncate_render_cells. Pane width is derived from app.size.width and the
modal's CSS geometry (the modal widened to 95%, col-list shrunk to 24,
2-cell ellipsis reserve, 11-cell slack). Adds a multi-row block-glyph
histogram, lo/mid/hi x-axis ticks (date-prefix for ISO timestamps,
3-sig-fig for floats), and a proportional top-value bar chart capped at
36 cells so binary distributions don't run the pane. Drops the yellow
modal outlines now that $surface contrast against the dimmed body is
enough affordance.
The modal's _profile_for() helper now falls back from the built
outputs/v{n}/<slug>/profile.json to the tracked
docs/v{n}/profiles/<slug>.json, so fresh clones see sparklines without
rebuilding. Tests cover the cell-width renderer across three pane sizes,
histogram / x-axis / top-value-bar geometry, and the full search-parser
matrix including the explicit-null regression.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Replace the old 12-tag subject-matter vocabulary (geospatial / nlp-text / finance / etc.) with 13 data-kind tags grouped by content axis: string content (urls, prose, enums, identifiers, code-strings), numeric content (timestamps, embeddings, counts, monetary, measurements), and payload structure (coordinates, binary-payload, nested-json). All 249/249 specs now carry 1-3 inline tags emitted by the autotag classifier, and the schema + discovery view presets are updated to match. Rewrite 45 of 46 Public BI workload descriptions with a per-workload data-shape lead (rows x cols, type-family mix, notable columns) plus a hand-curated Background / Likely note grounded in actual column names rather than the often-misleading workbook label (e.g. bi-romance is Instagram posts, bi-physicians is CMS Medicare payments, bi-iglocations1 is US Census geographic codes). Collapse SHOWCASE_TIERS from 4 editorial tiers to 2 (encoding + stress) with 9 slugs across them; docs.py drops _KIND_BY_FAMILY and infers data kind from parse.reader + transform.handler directly, and wraps pq.ParquetFile in a try/except so a missing parquet falls through to the snapshot. Also annotates WDI with a 2026-05-16 rot note for the expired TLS cert on databankfiles.worldbank.org. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The manifest no longer carries `family` (handled in the catalog metadata commit), so this drops the now-dead surface area in build/convert/status/ spec: the `--family` argparse flags, the `[family]` per-spec print header in build.py, and the `iter_datasets(family=...)` kwarg. Each slug is invoked by name now. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Copy edits across README / AGENTS / SKILLS / skill SKILL.md files / minimal_spec.json so the human-readable docs reflect the post-`--family` pipeline and the data-kind tag / 2-tier showcase vocab. No behavior change; mirrors the catalog overhaul + family-removal commits that landed alongside this one. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The prior `ndv <= 64 OR ratio <= 0.01` was too lax — on a 100M-row dataset the ratio gate alone admits up to a million distinct string values as an "enum", which is really a high-cardinality categorical. The bare-fallback path also returned "enums" for any short string column whose NDV was unknown, defaulting to enums on no evidence. New rule requires BOTH small NDV (≤ 32) AND short mean_len (≤ 24), with a looser high-row-count path gated on `ndv ≤ 256 AND ratio ≤ 0.001 AND mean_len ≤ 24`. Short strings with unknown NDV now fall through to no tag rather than defaulting to enums; short strings with NDV > 256 are reclassified as identifiers. Slugs with at least one column tagged "enums" drop from 181 to 165; "identifiers" rises from 39 to 99; "prose" from 72 to 73. All 249 slugs still carry at least one tag. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Three related fixes to embeddings tag detection. profile.py now renders list element dtype recursively (list<float32>, large_list<int64>, list<struct>) and adds a dedicated fixed_size_list branch that records the constant list_size as length stats — previously fixed_size_list columns profiled as null, hiding glove-6b-100d's 100-dim float vector entirely. autotag.py reads the richer dtype label to classify list<float|double> as embeddings rather than nested-json. The slug-name fallback now uses a word-bounded regex (\b(embeddings?|word vectors?|dense vectors?|glove|word2vec|fasttext|encoder output)\b) so descriptions like "sensors embedded in an Air Quality device" no longer trigger the embeddings tag; the bare "vector" substring is dropped to avoid matching "attack vector" / "supply vector". Re-profiling the eleven locally-built list-bearing slugs and re-running autotag moves the embeddings-tagged set from 5 to 6 (finepdfs-en-test newly tagged via its four list<float64> score columns); uci-air-quality stays untagged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The prior column-level tightening (ndv ≤ 32 AND mean_len ≤ 24) reduced enums from 181 to 165, but residual breadth came from single class-label string columns triggering the slug-level top-3 ranking — e.g. uci-iris species, uci-wine-quality quality bucket, uci-sms-spam-collection ham/spam. One class label does not make a dataset enum-shaped. This adds a slug-level filter dropping 'enums' from the candidate set unless >=2 columns qualify, matching the user's ">=50% of cols qualify" concern with a simpler absolute floor. Also drops 'enums' from the public_bi_merge handler fallback (per-workload BI slugs with profiles get enums classified from their own column data). enums: 165 -> 128. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
After a successful profile run, scripts.pipeline.profile now calls promote_profiles.promote() on the slugs it just processed so the tracked docs/v1/profiles/ mirror stays in sync with outputs/v1/<slug>/profile.json without humans remembering a second command. Idempotent — byte-identical destinations are skipped, and a mirror-step exception is caught and logged rather than re-raised so profile success isn't undone by a sync glitch. Adds a --no-promote opt-out for cases where the user is iterating and doesn't want the diff churn in docs/v1/profiles/, and refreshes the raincloud-profile skill card so agents see the new workflow plus the tracked-mirror fallback path used by list_datasets --inspect and the TUI. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds an optional `fetch.verify_tls` boolean (default true) to the manifest schema and wires it through fetch_http. When set to false on a specific slug, the HTTPS download uses an SSL context with check_hostname disabled and verify_mode set to CERT_NONE, and silences the urllib3 InsecureRequestWarning for that fetch. Intended as a documented escape hatch for upstreams whose certificates have rotted while their payload is still served intact — integrity is preserved via the existing expected_sha256 gate. Default behaviour is unchanged: when verify_tls is unset or true, urllib uses its default verifying context exactly as before. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The World Bank's databank.worldbank.org redirects WDI_CSV.zip to databankfiles.worldbank.org, which has been serving an expired TLS cert since at least May 2026. The payload itself is intact, so we bypass verification by setting fetch.verify_tls=false on the WDI spec and lock fetch integrity to expected_sha256=1c8ccf64...3bec (280,478,558 bytes). Built and profiled at 395,276 rows × 70 columns; the v1 docs snapshot and the WDI row in docs/v1/datasets.md now reflect real row counts and file sizes again instead of dashes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Wider-budget overnight pass (--include-large --budget-mins 30 --max-slugs 5)
landed 4 of the 17 LARGE_BLOCKLIST slugs:
- beir-msmarco (9.4M rows, 10s)
- bi-commongovernment (templated description re-enriched from profile)
- clickbench-hits (100M rows, 6.5 min)
- fineweb-sample-10bt (14.9M rows, 7 min)
ghcn-daily timed out at 30 min during fetch (3.7 GB tar.gz). The remaining
12 blocklist slugs (jsonbench-bluesky-100m, laion-400m, openlibrary-*,
openorca, osm-germany-{nodes,relations}, slimpajama-6b, stackoverflow-{
posts,postlinks}, wikipedia-structured-contents, nypd-complaints) need a
wider budget than this pass; queue them for a future overnight run.
bi-commongovernment description rewritten from the previously-templated CWI
boilerplate to the data-shape-led form (US federal contract / grant awards
data on USAspending.gov, 141.1M rows × 56 columns). autotag re-run picked up
the four new profiles and updated five slug tag sets accordingly.
235/249 profiles tracked (was 231). Remaining 14: 12 LARGE_BLOCKLIST + WDI
already profiled + ghcn-daily timeout to retry.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…, openlibrary-authors
Second targeted pass with --budget-mins 60 against the four most tractable
remaining LARGE_BLOCKLIST slugs. All four landed cleanly:
- ghcn-daily 3.18B rows × 7 cols (retry after r1 timeout)
- laion-400m 2.82M rows × 8 cols
- nypd-complaints 8.78M rows × 35 cols
- openlibrary-authors 15.18M rows × 5 cols
239/249 profiles tracked. Remaining 10 are the heaviest of the heavy:
jsonbench-bluesky-100m, openlibrary-{editions,works}, openorca,
osm-germany-{nodes,relations}, slimpajama-6b, stackoverflow-{posts,
postlinks}, wikipedia-structured-contents. These need overnight budgets
in the 1.5-4h range per slug and remain queued for a future pass.
autotag re-run updated tags on the four new slugs to reflect their actual
column shape.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The 0.1.4 entry was last refreshed against the catalog-discoverability bundle on 2026-05-12; since then it has accumulated TUI search + Columns-modal rendering refresh, per-slug verify_tls opt-in (WDI re-enable), promote_profiles tooling + profile.py auto-promote, list-element dtype rendering (incl. fixed_size_list — previously silently skipped for glove-style embedding columns), autotag enums tightening + embeddings false-positive fix, public BI workload description rewrites (46 slugs), removal of DatasetSpec.family / --family CLI flag / curate.py, and 10 newly-tracked profiles from the LARGE_BLOCKLIST. Date bumped to 2026-05-17. Schema section gains the new fetch.verify_tls field. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Overnight queue (--budget-mins 720) churned through the last 10 LARGE_BLOCKLIST slugs in 3h 32min wall-clock. All succeeded: - jsonbench-bluesky-100m 100M rows × 1 col (28 min) - openlibrary-editions 56M rows × 5 cols (58 min) - openlibrary-works 41M rows × 5 cols (16 min) - openorca 4M rows × 5 cols (53 s) - osm-germany-nodes 433M rows × 7 cols (65 min) - osm-germany-relations 890k rows × 5 cols (26 s) - slimpajama-6b 5.5M rows × 3 cols (3 min) - stackoverflow-postlinks 6.5M rows × 5 cols (1 min) - stackoverflow-posts 58M rows × 16 cols (9 min) - wikipedia-structured-contents 10M rows × 17 cols (29 min) Catalog reaches 249/249 tracked profiles — full coverage for the first time. autotag picked up the 10 new profiles and refreshed tag sets accordingly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…arity - Update "Tracked profiles" line to reflect 249/249 coverage (was the partial-coverage status with a "10 multi-hour giants queued" remnant). - Drop session-state phrasing: "was 49 before this release window", "blocklist trio", "newly-locked", "Skipped: ..." (which read as "skipped from what?" without prior context). - Drop internal stat deltas (e.g. "181 → 128 slug-level enum tags") and file-path listings (e.g. "Affects build.py, convert.py, spec.iter_datasets, status.py, docs.py:_KIND_BY_FAMILY") — the rule and the user-visible effect carry the entry, not the dev-loop intermediate state. - Move `fixed_size_list` from a feature mention to its own Fixed bullet (it's a bug fix, not a designed addition). - Consolidate the three `sources.schema.json` field additions into one sub-bulleted entry in the Schema section. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The sha256-based cache in profile.py locked in stale outputs from a buggy pre-release version of the profiler — some string columns were recorded as bare null even though the underlying parquet had populated values. Add --force to bypass the cache when a re-profile is needed without manually deleting profile.json. Refresh 12 locally-buildable slugs (including the user-flagged anthropic-economic-index, whose 6 string columns now carry the proper null_count + ndv_approx + mean_length + top_values shape). The profiles-with-at-least-one-null-value-column count drops from 44 to 43 on this branch; the remaining 43 are legitimate struct, all-null-typed, or dictionary-encoded columns the profiler intentionally leaves null. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Python's `:.3g` switches to scientific notation as soon as the exponent is ≥ 3, so `_format_axis_value` was painting histogram tick labels like `1e+03`, `9.99e+03`, `1.23e+04` for counts and measurements that the reader would naturally read as 1,000 / 9,990 / 12,300. Floats in `[1, 100_000)` now render in standard form with commas and ~3 sig figs after the leading digit (1.23, 12.3, 123, 1,234, 12,345). Outside that range the existing `:.3g` fallback keeps very small / very large floats compact. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
`_render_column_detail` was telling the user to run `python -m scripts.pipeline.profile <slug>` for every column whose profile entry was null — including struct, variant, and all-null columns that `profile.py` intentionally returns null for at the column-map level. Re-running profile.py would produce the same null, making the hint actively misleading. Distinguish the two cases at render time: a non-empty `profile_columns` map means the slug WAS profiled and this column was deliberately skipped — show a brief "no per-element distribution" hint pointing the reader at the row-group stats above instead. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Autofix landed I001 (sort imports), E401 (split multi-imports), and F401 (unused imports) across 14 files. Manual fixes for the six issues that needed judgement: drop unused `range_cells` / `rc2` locals, lift the mid-file `import pytest` in two test files to the top, replace a string-form `"Region"` return annotation with no annotation, replace a string-form `"Path"` parameter annotation with the real `Path` type plus the matching import. CI lint clean; 183 tests pass. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
0.1.4 turns the catalog into something you can navigate without scrolling
docs/v1/datasets.md, and ships per-column profiles for every slug. Newsources.jsonfields (tags,showcase) and derived parquet metadata(size buckets, shape traits) drive TUI facets, a
/-focused search input,and matching
list_datasets --tag/--showcase/--size/--trait/--view/--inspectflags. Per-column profiles now ship underdocs/v1/profiles/<slug>.jsonfor all 249 specs, so a fresh clone canrender the TUI Columns pane and run
list_datasets --inspect <slug>without rebuilding anything locally.
Two things a reviewer will want to verify against the diff.
DatasetSpec.familyand the--familyflag onbuild/convertareremoved — each slug is invoked by name now (or
--allfor whole-catalogpasses). And
fetch.verify_tlsis a new optional field (defaulttrue)for upstreams whose TLS certs have rotted; WDI is the only slug that
currently sets it to
false, with payload integrity gated byexpected_sha256. Full details inCHANGELOG.md.🤖 Generated with Claude Code