Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3672335
release 0.1.4: catalog discoverability bundle
mprammer May 12, 2026
81a0657
profile: cast histogram float literals to DOUBLE; skip empty-named co…
mprammer May 16, 2026
0b0b2f4
profiles: checkpoint batch (49 → 71 slugs) from overnight pass
mprammer May 16, 2026
195dee7
profiles: checkpoint batch (71 → 90 slugs)
mprammer May 16, 2026
cc16b9d
profile: TIME columns fall back to string profile
mprammer May 16, 2026
a261a05
profiles: checkpoint batch (90 → 110 slugs)
mprammer May 16, 2026
f667c43
profiles: checkpoint batch (110 → 130 slugs)
mprammer May 16, 2026
6c0be45
profiles: checkpoint batch (130 → 150 slugs)
mprammer May 16, 2026
a7ec8a2
profiles: checkpoint batch (151 → ~170 slugs)
mprammer May 16, 2026
b1dc819
profiles: checkpoint batch (170 → ~190 slugs)
mprammer May 16, 2026
be82ed3
profiles: checkpoint batch (191 → ~210 slugs)
mprammer May 16, 2026
d06bd7c
profiles: final overnight batch (215 -> 229) + flag WDI as rotted
mprammer May 16, 2026
4979855
profiles: bi-cmsprovider + bi-trainsuk1 retry (231 / 249 tracked)
mprammer May 16, 2026
4abdfed
tooling: promote built profiles into docs/v1/profiles/ mirror
mprammer May 16, 2026
cfda114
catalog: add autotag + Public BI description-enrichment helpers
mprammer May 16, 2026
fed7559
changelog: drop family + 1-4 preset claims from 0.1.4 entry
mprammer May 16, 2026
f0ee3ec
docs: promote v1 snapshot to 2026-05-16 catalog state
mprammer May 16, 2026
7a47cb0
list_datasets: catalog discoverability flags + tracked-profile fallba…
mprammer May 16, 2026
b47fa88
tui: faceted search + Columns-modal rendering refresh
mprammer May 16, 2026
6fcb97c
catalog: data-kind tag vocab + Public BI enrichment + showcase reshape
mprammer May 16, 2026
85388c2
pipeline: drop --family CLI flag and iter_datasets kwarg
mprammer May 16, 2026
bab076f
docs: drop --family from examples; refresh skill cards for new vocab
mprammer May 16, 2026
82e7b74
autotag: tighten enums criteria (ndv ≤ 32 + short mean_len)
mprammer May 16, 2026
4426038
autotag: detect list<float> embeddings + drop "embedded" false-positive
mprammer May 16, 2026
a5b1b22
autotag: require ≥2 enum cols before tagging a slug enum-shaped
mprammer May 16, 2026
4a49c23
profile: auto-promote built profiles into tracked mirror
mprammer May 17, 2026
44a5f9a
fetch: per-slug verify_tls opt-in for expired-cert upstreams
mprammer May 17, 2026
35394e4
wdi: re-enable via verify_tls + sha256 lock; build + profile
mprammer May 17, 2026
869bfe4
profiles: blocklist overnight pass + bi-commongovernment enrichment
mprammer May 17, 2026
92ccfab
profiles: blocklist round 2 — ghcn-daily, laion-400m, nypd-complaints…
mprammer May 17, 2026
e7288e2
changelog: bring 0.1.4 entry up to date with shipped scope
mprammer May 17, 2026
0b26a80
profiles: blocklist round 3 — final 10 giants land; 249/249 tracked
mprammer May 17, 2026
d8efc98
changelog: refresh 0.1.4 for full catalog coverage and cold-reader cl…
mprammer May 17, 2026
d8e43cc
profile: add --force flag + refresh 12 stale pre-release profiles
mprammer May 17, 2026
f776dbc
tui: keep histogram axis labels in standard notation up to 100k
mprammer May 17, 2026
1955c15
tui: distinguish "no profile yet" from "column skipped by profile.py"
mprammer May 17, 2026
b0d2a8a
tui: drop "source of truth" sentence from skipped-column hint
mprammer May 17, 2026
9efb3c7
ruff: organize imports + drop unused locals + fix string annotations
mprammer May 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Wrappers around `python -m scripts.pipeline.<module>`. Side-effecting ones set `
| `/raincloud-tighten-variant` | `scripts.pipeline.tighten_variant` | In-place JSON → VARIANT promotion. |
| `/raincloud-status` | `scripts.pipeline.status` | Per-slug filesystem state (raw / workdir / parquet / vortex / variant-pending). *(read-only, model-invocable.)* |
| `/raincloud-validate-manifest` | `scripts.pipeline.validate_manifest` | Static checks for `sources.json` — JSON Schema + handler-registry / slug-uniqueness / fetch-auth cross-checks. *(read-only, model-invocable.)* |
| `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by family / handler / license / fetch-type / reader / vortex / regex. *(read-only, model-invocable.)* |
| `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by handler / license / fetch-type / reader / vortex / tag / showcase / size / regex. *(read-only, model-invocable.)* |

## Procedural playbooks (model-invocable)

Expand Down
3 changes: 0 additions & 3 deletions .agents/skills/raincloud-add-dataset/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ Steps:
"short_name": "My Dataset",
"full_name": "My Dataset (publisher attribution)",
"description": "One-line summary.",
"family": "direct",
"license": { "spdx": "CC0-1.0", "source_url": "...", "redistribution_permitted": true, "attribution_required": false },
"fetch": { "type": "http", "urls": ["https://..."], "auth": null },
"extract": { "type": "passthrough" },
Expand All @@ -32,8 +31,6 @@ Steps:
}
```

Pick `family` from existing values (`direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci`) — do not invent new ones without discussing.

3. **Validate the manifest.** Invoke `/raincloud-validate-manifest` — sub-second check that the new entry has the right shape, the handler resolves, the slug is unique, and `fetch.type`/`fetch.auth` agree. Catches typos before paying for a fetch.

4. **Run the first build.** Invoke `/raincloud-build <slug> --loose` (the `--loose` is essential when the row count is a guess). If `expect.rows` was wrong, update the manifest with the actual count once the build succeeds.
Expand Down
13 changes: 9 additions & 4 deletions .agents/skills/raincloud-build/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: raincloud-build
description: Run the full Raincloud pipeline (fetch → extract → parse → transform → write → validate → convert) for one or more dataset slugs. Use when the user asks to build a dataset, rebuild a slug, or process a family.
argument-hint: <slug>... | --family <name> | --all [--loose] [--clean-workdir]
description: Run the full Raincloud pipeline (fetch → extract → parse → transform → write → validate → convert) for one or more dataset slugs. Use when the user asks to build a dataset, rebuild a slug, or process a batch.
argument-hint: <slug>... | --all [--loose] [--clean-workdir]
disable-model-invocation: true
allowed-tools: Bash(python -m scripts.pipeline.build *)
---
Expand All @@ -14,12 +14,11 @@ python -m scripts.pipeline.build $ARGUMENTS

Selection (at least one required):
- `<slug>...` — positional dataset slugs (any number)
- `--family <name>` — every dataset in a family (`direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci`)
- `--all` — every dataset in `sources.json`

Modifiers:
- `--loose` — downgrade `expect.rows` mismatches from errors to warnings. Use on the first build of a new slug before you know the exact row count.
- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for whole-family runs (Public BI decompressed CSVs can hit ~100 GB).
- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for large batch runs (Public BI decompressed CSVs can hit ~100 GB).

Before running:
- **Confirm with the user** before triggering anything non-trivial. JSONBench 100M ≈ 6 h, Wikipedia Structured Contents → 34 GB parquet, OSM Germany ~45 min per kind. Small (<100 MB) parquets are fine without asking. (See [AGENTS.md "Rebuilding is expensive"](../../context/AGENTS.md).)
Expand All @@ -29,3 +28,9 @@ Before running:
After a successful build, suggest running `/raincloud-docs` to regenerate derived docs.

Context: [SKILLS.md](../../context/SKILLS.md), [AGENTS.md](../../context/AGENTS.md).

## Note (0.1.4)

Per-column profiles are a separate opt-in stage:
`python -m scripts.pipeline.profile <slug>` after a build. Not part of the
default build pipeline. See `raincloud-profile` skill.
3 changes: 1 addition & 2 deletions .agents/skills/raincloud-convert/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: raincloud-convert
description: Run only stage 7 — emit a sibling .vortex next to each opted-in parquet. Use when the user asks to (re)convert parquet files to Vortex format without rebuilding from raw bytes.
argument-hint: <slug>... | --family <name> | --all
argument-hint: <slug>... | --all
disable-model-invocation: true
allowed-tools: Bash(python -m scripts.pipeline.convert *)
---
Expand All @@ -14,7 +14,6 @@ python -m scripts.pipeline.convert $ARGUMENTS

Selection (at least one required):
- `<slug>...` — positional slugs
- `--family <name>` — every dataset in a family
- `--all` — every dataset in the manifest

Behavior:
Expand Down
37 changes: 37 additions & 0 deletions .agents/skills/raincloud-discover/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
name: raincloud-discover
description: Use when the user wants to find "interesting" datasets — exposes the new tag / showcase / size / trait / view filters on list_datasets.
---

# raincloud-discover

Wraps `python -m scripts.pipeline.list_datasets` with the discoverability flags. The TUI (`python -m scripts.pipeline.browse`) is the interactive sibling.

## Closed vocab

- **Showcase tiers** (2): `encoding`, `stress`. Run `--showcase-help` for live counts.
- **Domain tags** (12): `geospatial`, `nlp-text`, `web-analytics`, `e-commerce`, `finance`, `social`, `scientific`, `healthcare`, `sports`, `transportation`, `government`, `benchmark`. Run `--tags-help` for live counts.
- **Size buckets** (5): `xs / s / m / l / xl` (file-size on disk).
- **Trait flags** (6): `has_nested`, `has_timestamp`, `has_variant`, `string_heavy`, `wide_row`, `high_cardinality_present`. Prefix with `!` to negate.

## Patterns

```bash
python -m scripts.pipeline.list_datasets --view encoding --long
python -m scripts.pipeline.list_datasets --tag geospatial --tag scientific
python -m scripts.pipeline.list_datasets --trait has_nested --size m
python -m scripts.pipeline.list_datasets --trait '!has_nested' --vortex
python -m scripts.pipeline.list_datasets --inspect clickbench-hits
python -m scripts.pipeline.list_datasets --tags-help
python -m scripts.pipeline.list_datasets --showcase-help
```

Filters AND across axes, OR within an axis. `--view` replaces other facet flags (preset is the entire facet spec).

## When to invoke

- "Find me datasets with X property" → use traits + tags + size filters
- "What's good for testing nested-type encoding?" → `--trait has_nested --vortex`
- "Show me a small dataset to iterate on" → `--size xs --size s`
- "Inspect this slug" → `--inspect <slug>`
- "What tags / tiers exist?" → `--tags-help` / `--showcase-help`
2 changes: 1 addition & 1 deletion .agents/skills/raincloud-docs/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ python -m scripts.pipeline.docs && cp docs/datasets.md docs/handlers.md docs/sna

When to run:
- After a build (row counts, file sizes, and `snapshot.json` schema for that slug change).
- After manifest edits that affect `short_name` / `license` / `description` / `family` / `expect.rows` / `convert.vortex`.
- After manifest edits that affect `short_name` / `license` / `description` / `expect.rows` / `convert.vortex`.
- After adding, removing, or renaming a handler (`handlers.md` regenerates from the registry + manifest usage).
- After any schema-affecting change to a slug's transform handler — re-run `snapshot` so the TUI's fallback view picks up the new shape.

Expand Down
6 changes: 3 additions & 3 deletions .agents/skills/raincloud-large-build/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: raincloud-large-build
description: Run a memory- or runtime-heavy build safely with memory caps, scratch redirection, nohup, and progress logging. Use for multi-hour or multi-GB builds (JSONBench 100M, Wikipedia Structured Contents, OSM Germany, Public BI families).
argument-hint: <slug> [--family <name> | --all] [--loose] [--clean-workdir]
description: Run a memory- or runtime-heavy build safely with memory caps, scratch redirection, nohup, and progress logging. Use for multi-hour or multi-GB builds (JSONBench 100M, Wikipedia Structured Contents, OSM Germany, Public BI batches).
argument-hint: <slug>... [--all] [--loose] [--clean-workdir]
disable-model-invocation: true
---

Expand All @@ -21,7 +21,7 @@ PYTHONUNBUFFERED=1 \

Flag rationale:
- `--loose` — first build of a new slug, before `expect.rows` is known. Downgrades row-count mismatches from errors to warnings.
- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for whole-family runs (Public BI decompressed CSVs can hit ~100 GB).
- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for large batch runs (Public BI decompressed CSVs can hit ~100 GB).
- `RAINCLOUD_DUCKDB_MEMORY_LIMIT` — caps DuckDB's working set; default (~80% of system RAM) can swap-thrash on heavily-nested VARIANT shredding. 96 GB is the tested ceiling for Open Food Facts.
- `RAINCLOUD_DUCKDB_TEMP_DIRECTORY` — point at a large volume; the system tempdir often runs out on big builds.
- `PYTHONUNBUFFERED=1` — log file flushes line-by-line so progress is inspectable mid-run.
Expand Down
13 changes: 8 additions & 5 deletions .agents/skills/raincloud-list-datasets/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: raincloud-list-datasets
description: Filter and list datasets from sources.json without grepping a 313 KB JSON file. Use when the user asks "which slugs use handler X", "show me all UCI datasets", "what's gated behind Kaggle ToS", or any other catalog-shape question that's faster than reading docs/v1/datasets.md (55 KB) end to end.
argument-hint: [--family <f>] [--handler <h>] [--license <spdx>] [--fetch-type <t>] [--reader <r>] [--vortex|--no-vortex] [--kaggle-tos] [--grep <pattern>] [--long|--json|--count]
description: Filter and list datasets from sources.json without grepping a 545 KB JSON file. Use when the user asks "which slugs use handler X", "show me all UCI datasets", "what's gated behind Kaggle ToS", or any other catalog-shape question that's faster than reading docs/v1/datasets.md end to end.
argument-hint: [--handler <h>] [--license <spdx>] [--fetch-type <t>] [--reader <r>] [--vortex|--no-vortex] [--kaggle-tos] [--grep <pattern>] [--long|--json|--count]
allowed-tools: Bash(python -m scripts.pipeline.list_datasets *)
---

Expand All @@ -17,8 +17,7 @@ Filters (compose with AND):

| Flag | Filter |
|---|---|
| `--family <f>` | `family` ∈ `direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci` |
| `--handler <h>` | `transform.handler` exact match (e.g. `tighten_types`, `glove_split`) |
| `--handler <h>` | `transform.handler` exact match (e.g. `tighten_types`, `glove_split`, `uci_default`) |
| `--license <spdx>` | `license.spdx` exact match (e.g. `CC0-1.0`, `Apache-2.0`) |
| `--fetch-type <t>` | `fetch.type` ∈ `http`, `kaggle`, `huggingface`, `custom` |
| `--reader <r>` | `parse.reader` ∈ `csv`, `parquet`, `jsonl`, `xml`, `pbf`, `custom` |
Expand All @@ -28,7 +27,7 @@ Filters (compose with AND):

Output modes (default = one slug per line):

- `--long` — wide table with slug, family, handler, fetch type, reader, license, row count, vortex flag.
- `--long` — wide table with slug, handler, fetch type, reader, license, row count, vortex flag.
- `--json` — one JSON object per matching dataset (pipe into `jq` for further filtering).
- `--count` — just the count of matches.

Expand All @@ -51,3 +50,7 @@ python -m scripts.pipeline.list_datasets --grep '\bgeo' --long
Pair with `/raincloud-status <slug>` to check filesystem state of any returned slug, and `/raincloud-validate-manifest` after editing the manifest based on findings.

Context: [SKILLS.md](../../context/SKILLS.md), [sources.schema.md](../../context/sources.schema.md), [`docs/v1/datasets.md`](../../../docs/v1/datasets.md) for full-row metadata.

## New discovery axes (0.1.4)

`--showcase`, `--tag`, `--size`, `--trait`, `--view`, `--inspect`, `--tags-help`, `--showcase-help`. See the `raincloud-discover` skill for the full vocabulary and patterns.
46 changes: 46 additions & 0 deletions .agents/skills/raincloud-profile/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
name: raincloud-profile
description: Use when the user asks to compute or refresh per-column statistics for a raincloud dataset — produces `outputs/v1/<slug>/profile.json` for the TUI's detail pane and `list_datasets --inspect`.
---

# raincloud-profile

Wraps `python -m scripts.pipeline.profile`. Opt-in stage; off the default
build path. Idempotent against parquet sha256.

## When to invoke

- "Generate profiles for X / for everything that's built"
- "Refresh the profile after I rebuilt slug X"
- "Show me per-column statistics" → run this first, then `list_datasets --inspect <slug>`

## Patterns

```bash
python -m scripts.pipeline.profile <slug> # one
python -m scripts.pipeline.profile --all # every built parquet
python -m scripts.pipeline.profile --sample-rows 1000000 <slug> # cap for huge slugs
```

Per-dtype stats:
- **Numeric**: histogram + NDV + min/max/mean
- **String / binary**: NDV; top-5 when NDV ≤ 256 (binary skips top-5 — bytes don't render usefully)
- **Bool**: T/F/null counts
- **Date/Timestamp**: range + 10-bucket histogram
- **List/Map**: length min/max/mean
- **Struct / variant**: skipped (emits null at column-map level)

Profiles live at `outputs/v1/<slug>/profile.json` and are read by:
- The TUI's right-pane Columns section (`python -m scripts.pipeline.browse`)
- The CLI's `--inspect <slug>` rendering
- `docs.py` for backfilling `shape_traits.high_cardinality_present` into `snapshot.json`

The tracked mirror at `docs/v1/profiles/<slug>.json` is the fallback path that
fresh clones ship — both `list_datasets --inspect` and the TUI's Columns pane
read it when no built `outputs/v1/<slug>/profile.json` exists locally. The
`profile` stage auto-runs `python -m scripts.pipeline.promote_profiles` at the
end of a successful run (copies built profile.json into the tracked mirror,
byte-identical mirror skipped), so the tracked snapshot stays in sync without
a manual step. Pass `--no-promote` to suppress that auto-step while iterating;
invoke `promote_profiles` explicitly (or with `--check`) to re-sync or audit
the mirror after manual edits.
3 changes: 1 addition & 2 deletions .agents/skills/raincloud-status/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: raincloud-status
description: Report per-dataset state (raw / workdir / parquet / vortex / variant-pending) across the manifest. Use when the user asks what's downloaded, what's built, what's missing, what needs re-tightening, or to triage which slugs still need work.
argument-hint: [<slug>...] [--family <name>] [--fast] [--missing-only] [--json]
argument-hint: [<slug>...] [--fast] [--missing-only] [--json]
allowed-tools: Bash(python -m scripts.pipeline.status *)
---

Expand All @@ -23,7 +23,6 @@ Walks the manifest and reports per-slug filesystem state in five columns:

Selection (default: every slug in the manifest):
- `<slug>...` — positional slugs
- `--family <name>` — every dataset in a family
- `--all` — explicit "every dataset" (the default, kept for parity with `/raincloud-build`)

Modifiers:
Expand Down
2 changes: 1 addition & 1 deletion .agents/skills/raincloud-validate-manifest/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Modifiers:
Exit code: `0` on success (warnings allowed), `1` on errors.

When to invoke:
- After editing `sources.json` (especially handler renames, family changes, slug additions).
- After editing `sources.json` (especially handler renames, license changes, slug additions).
- Before `/raincloud-build` on a fresh slug — catches typo'd handler names without paying for a fetch.
- As the read-only counterpart to `/raincloud-status`: `/raincloud-status` reports filesystem state, `/raincloud-validate-manifest` reports manifest correctness.

Expand Down
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,13 @@ Validates `sources.json` against [`sources.schema.json`](sources.schema.json) (D
For catalog queries that would otherwise require greping the ~545 KB `sources.json` (or scrolling ~158 KB of [`docs/v1/datasets.md`](docs/v1/datasets.md)):

```bash
python -m scripts.pipeline.list_datasets --family uci --count
python -m scripts.pipeline.list_datasets --handler uci_default --count
python -m scripts.pipeline.list_datasets --handler tighten_types --long
python -m scripts.pipeline.list_datasets --fetch-type kaggle --kaggle-tos
python -m scripts.pipeline.list_datasets --grep '\bgeo' --long
```

Filters compose with AND across `--family`, `--handler`, `--license`, `--fetch-type`, `--reader`, `--vortex` / `--no-vortex`, `--kaggle-tos`, `--grep`. Output modes: default (one slug per line), `--long` (wide table), `--json` (jq-friendly), `--count`.
Filters compose with AND across `--handler`, `--license`, `--fetch-type`, `--reader`, `--vortex` / `--no-vortex`, `--kaggle-tos`, `--grep`. Output modes: default (one slug per line), `--long` (wide table), `--json` (jq-friendly), `--count`.

If the user wants to *browse* interactively rather than query, point them at `python -m scripts.pipeline.browse` (read-only Textual TUI over the same data; requires `uv sync --extra tui --inexact`). It's a human-facing tool — don't try to run it from an agent context, since it won't render and will hang waiting for keystrokes.

Expand Down
Loading
Loading