spiraldb · mprammer · May 17, 2026 · May 12, 2026 · May 16, 2026 · May 16, 2026
diff --git a/.agents/skills/README.md b/.agents/skills/README.md
@@ -19,7 +19,7 @@ Wrappers around `python -m scripts.pipeline.<module>`. Side-effecting ones set `
 | `/raincloud-tighten-variant` | `scripts.pipeline.tighten_variant` | In-place JSON → VARIANT promotion. |
 | `/raincloud-status` | `scripts.pipeline.status` | Per-slug filesystem state (raw / workdir / parquet / vortex / variant-pending). *(read-only, model-invocable.)* |
 | `/raincloud-validate-manifest` | `scripts.pipeline.validate_manifest` | Static checks for `sources.json` — JSON Schema + handler-registry / slug-uniqueness / fetch-auth cross-checks. *(read-only, model-invocable.)* |
-| `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by family / handler / license / fetch-type / reader / vortex / regex. *(read-only, model-invocable.)* |
+| `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by handler / license / fetch-type / reader / vortex / tag / showcase / size / regex. *(read-only, model-invocable.)* |
 
 ## Procedural playbooks (model-invocable)
 

diff --git a/.agents/skills/raincloud-add-dataset/SKILL.md b/.agents/skills/raincloud-add-dataset/SKILL.md
@@ -21,7 +21,6 @@ Steps:
      "short_name": "My Dataset",
      "full_name": "My Dataset (publisher attribution)",
      "description": "One-line summary.",
-     "family": "direct",
      "license": { "spdx": "CC0-1.0", "source_url": "...", "redistribution_permitted": true, "attribution_required": false },
      "fetch":     { "type": "http", "urls": ["https://..."], "auth": null },
      "extract":   { "type": "passthrough" },
@@ -32,8 +31,6 @@ Steps:
    }
    ```
 
-   Pick `family` from existing values (`direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci`) — do not invent new ones without discussing.
-
 3. **Validate the manifest.** Invoke `/raincloud-validate-manifest` — sub-second check that the new entry has the right shape, the handler resolves, the slug is unique, and `fetch.type`/`fetch.auth` agree. Catches typos before paying for a fetch.
 
 4. **Run the first build.** Invoke `/raincloud-build <slug> --loose` (the `--loose` is essential when the row count is a guess). If `expect.rows` was wrong, update the manifest with the actual count once the build succeeds.

diff --git a/.agents/skills/raincloud-build/SKILL.md b/.agents/skills/raincloud-build/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: raincloud-build
-description: Run the full Raincloud pipeline (fetch → extract → parse → transform → write → validate → convert) for one or more dataset slugs. Use when the user asks to build a dataset, rebuild a slug, or process a family.
-argument-hint: <slug>... | --family <name> | --all  [--loose] [--clean-workdir]
+description: Run the full Raincloud pipeline (fetch → extract → parse → transform → write → validate → convert) for one or more dataset slugs. Use when the user asks to build a dataset, rebuild a slug, or process a batch.
+argument-hint: <slug>... | --all  [--loose] [--clean-workdir]
 disable-model-invocation: true
 allowed-tools: Bash(python -m scripts.pipeline.build *)
 ---
@@ -14,12 +14,11 @@ python -m scripts.pipeline.build $ARGUMENTS
 
 Selection (at least one required):
 - `<slug>...` — positional dataset slugs (any number)
-- `--family <name>` — every dataset in a family (`direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci`)
 - `--all` — every dataset in `sources.json`
 
 Modifiers:
 - `--loose` — downgrade `expect.rows` mismatches from errors to warnings. Use on the first build of a new slug before you know the exact row count.
-- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for whole-family runs (Public BI decompressed CSVs can hit ~100 GB).
+- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for large batch runs (Public BI decompressed CSVs can hit ~100 GB).
 
 Before running:
 - **Confirm with the user** before triggering anything non-trivial. JSONBench 100M ≈ 6 h, Wikipedia Structured Contents → 34 GB parquet, OSM Germany ~45 min per kind. Small (<100 MB) parquets are fine without asking. (See [AGENTS.md "Rebuilding is expensive"](../../context/AGENTS.md).)
@@ -29,3 +28,9 @@ Before running:
 After a successful build, suggest running `/raincloud-docs` to regenerate derived docs.
 
 Context: [SKILLS.md](../../context/SKILLS.md), [AGENTS.md](../../context/AGENTS.md).
+
+## Note (0.1.4)
+
+Per-column profiles are a separate opt-in stage:
+`python -m scripts.pipeline.profile <slug>` after a build. Not part of the
+default build pipeline. See `raincloud-profile` skill.
diff --git a/.agents/skills/raincloud-convert/SKILL.md b/.agents/skills/raincloud-convert/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: raincloud-convert
 description: Run only stage 7 — emit a sibling .vortex next to each opted-in parquet. Use when the user asks to (re)convert parquet files to Vortex format without rebuilding from raw bytes.
-argument-hint: <slug>... | --family <name> | --all
+argument-hint: <slug>... | --all
 disable-model-invocation: true
 allowed-tools: Bash(python -m scripts.pipeline.convert *)
 ---
@@ -14,7 +14,6 @@ python -m scripts.pipeline.convert $ARGUMENTS
 
 Selection (at least one required):
 - `<slug>...` — positional slugs
-- `--family <name>` — every dataset in a family
 - `--all` — every dataset in the manifest
 
 Behavior:

diff --git a/.agents/skills/raincloud-discover/SKILL.md b/.agents/skills/raincloud-discover/SKILL.md
@@ -0,0 +1,37 @@
+---
+name: raincloud-discover
+description: Use when the user wants to find "interesting" datasets — exposes the new tag / showcase / size / trait / view filters on list_datasets.
+---
+
+# raincloud-discover
+
+Wraps `python -m scripts.pipeline.list_datasets` with the discoverability flags. The TUI (`python -m scripts.pipeline.browse`) is the interactive sibling.
+
+## Closed vocab
+
+- **Showcase tiers** (2): `encoding`, `stress`. Run `--showcase-help` for live counts.
+- **Domain tags** (12): `geospatial`, `nlp-text`, `web-analytics`, `e-commerce`, `finance`, `social`, `scientific`, `healthcare`, `sports`, `transportation`, `government`, `benchmark`. Run `--tags-help` for live counts.
+- **Size buckets** (5): `xs / s / m / l / xl` (file-size on disk).
+- **Trait flags** (6): `has_nested`, `has_timestamp`, `has_variant`, `string_heavy`, `wide_row`, `high_cardinality_present`. Prefix with `!` to negate.
+
+## Patterns
+
+```bash
+python -m scripts.pipeline.list_datasets --view encoding --long
+python -m scripts.pipeline.list_datasets --tag geospatial --tag scientific
+python -m scripts.pipeline.list_datasets --trait has_nested --size m
+python -m scripts.pipeline.list_datasets --trait '!has_nested' --vortex
+python -m scripts.pipeline.list_datasets --inspect clickbench-hits
+python -m scripts.pipeline.list_datasets --tags-help
+python -m scripts.pipeline.list_datasets --showcase-help
+```
+
+Filters AND across axes, OR within an axis. `--view` replaces other facet flags (preset is the entire facet spec).
+
+## When to invoke
+
+- "Find me datasets with X property" → use traits + tags + size filters
+- "What's good for testing nested-type encoding?" → `--trait has_nested --vortex`
+- "Show me a small dataset to iterate on" → `--size xs --size s`
+- "Inspect this slug" → `--inspect <slug>`
+- "What tags / tiers exist?" → `--tags-help` / `--showcase-help`
diff --git a/.agents/skills/raincloud-docs/SKILL.md b/.agents/skills/raincloud-docs/SKILL.md
@@ -25,7 +25,7 @@ python -m scripts.pipeline.docs && cp docs/datasets.md docs/handlers.md docs/sna
 
 When to run:
 - After a build (row counts, file sizes, and `snapshot.json` schema for that slug change).
-- After manifest edits that affect `short_name` / `license` / `description` / `family` / `expect.rows` / `convert.vortex`.
+- After manifest edits that affect `short_name` / `license` / `description` / `expect.rows` / `convert.vortex`.
 - After adding, removing, or renaming a handler (`handlers.md` regenerates from the registry + manifest usage).
 - After any schema-affecting change to a slug's transform handler — re-run `snapshot` so the TUI's fallback view picks up the new shape.
 

diff --git a/.agents/skills/raincloud-large-build/SKILL.md b/.agents/skills/raincloud-large-build/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: raincloud-large-build
-description: Run a memory- or runtime-heavy build safely with memory caps, scratch redirection, nohup, and progress logging. Use for multi-hour or multi-GB builds (JSONBench 100M, Wikipedia Structured Contents, OSM Germany, Public BI families).
-argument-hint: <slug> [--family <name> | --all] [--loose] [--clean-workdir]
+description: Run a memory- or runtime-heavy build safely with memory caps, scratch redirection, nohup, and progress logging. Use for multi-hour or multi-GB builds (JSONBench 100M, Wikipedia Structured Contents, OSM Germany, Public BI batches).
+argument-hint: <slug>... [--all] [--loose] [--clean-workdir]
 disable-model-invocation: true
 ---
 
@@ -21,7 +21,7 @@ PYTHONUNBUFFERED=1 \
 
 Flag rationale:
 - `--loose` — first build of a new slug, before `expect.rows` is known. Downgrades row-count mismatches from errors to warnings.
-- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for whole-family runs (Public BI decompressed CSVs can hit ~100 GB).
+- `--clean-workdir` — wipe `_workdir/<slug>/` after each successful build. Essential for large batch runs (Public BI decompressed CSVs can hit ~100 GB).
 - `RAINCLOUD_DUCKDB_MEMORY_LIMIT` — caps DuckDB's working set; default (~80% of system RAM) can swap-thrash on heavily-nested VARIANT shredding. 96 GB is the tested ceiling for Open Food Facts.
 - `RAINCLOUD_DUCKDB_TEMP_DIRECTORY` — point at a large volume; the system tempdir often runs out on big builds.
 - `PYTHONUNBUFFERED=1` — log file flushes line-by-line so progress is inspectable mid-run.

diff --git a/.agents/skills/raincloud-list-datasets/SKILL.md b/.agents/skills/raincloud-list-datasets/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: raincloud-list-datasets
-description: Filter and list datasets from sources.json without grepping a 313 KB JSON file. Use when the user asks "which slugs use handler X", "show me all UCI datasets", "what's gated behind Kaggle ToS", or any other catalog-shape question that's faster than reading docs/v1/datasets.md (55 KB) end to end.
-argument-hint: [--family <f>] [--handler <h>] [--license <spdx>] [--fetch-type <t>] [--reader <r>] [--vortex|--no-vortex] [--kaggle-tos] [--grep <pattern>] [--long|--json|--count]
+description: Filter and list datasets from sources.json without grepping a 545 KB JSON file. Use when the user asks "which slugs use handler X", "show me all UCI datasets", "what's gated behind Kaggle ToS", or any other catalog-shape question that's faster than reading docs/v1/datasets.md end to end.
+argument-hint: [--handler <h>] [--license <spdx>] [--fetch-type <t>] [--reader <r>] [--vortex|--no-vortex] [--kaggle-tos] [--grep <pattern>] [--long|--json|--count]
 allowed-tools: Bash(python -m scripts.pipeline.list_datasets *)
 ---
 
@@ -17,8 +17,7 @@ Filters (compose with AND):
 
 | Flag | Filter |
 |---|---|
-| `--family <f>` | `family` ∈ `direct`, `kaggle-upstream`, `nyc-tlc`, `public-bi`, `uci` |
-| `--handler <h>` | `transform.handler` exact match (e.g. `tighten_types`, `glove_split`) |
+| `--handler <h>` | `transform.handler` exact match (e.g. `tighten_types`, `glove_split`, `uci_default`) |
 | `--license <spdx>` | `license.spdx` exact match (e.g. `CC0-1.0`, `Apache-2.0`) |
 | `--fetch-type <t>` | `fetch.type` ∈ `http`, `kaggle`, `huggingface`, `custom` |
 | `--reader <r>` | `parse.reader` ∈ `csv`, `parquet`, `jsonl`, `xml`, `pbf`, `custom` |
@@ -28,7 +27,7 @@ Filters (compose with AND):
 
 Output modes (default = one slug per line):
 
-- `--long` — wide table with slug, family, handler, fetch type, reader, license, row count, vortex flag.
+- `--long` — wide table with slug, handler, fetch type, reader, license, row count, vortex flag.
 - `--json` — one JSON object per matching dataset (pipe into `jq` for further filtering).
 - `--count` — just the count of matches.
 
@@ -51,3 +50,7 @@ python -m scripts.pipeline.list_datasets --grep '\bgeo' --long
 Pair with `/raincloud-status <slug>` to check filesystem state of any returned slug, and `/raincloud-validate-manifest` after editing the manifest based on findings.
 
 Context: [SKILLS.md](../../context/SKILLS.md), [sources.schema.md](../../context/sources.schema.md), [`docs/v1/datasets.md`](../../../docs/v1/datasets.md) for full-row metadata.
+
+## New discovery axes (0.1.4)
+
+`--showcase`, `--tag`, `--size`, `--trait`, `--view`, `--inspect`, `--tags-help`, `--showcase-help`. See the `raincloud-discover` skill for the full vocabulary and patterns.
diff --git a/.agents/skills/raincloud-profile/SKILL.md b/.agents/skills/raincloud-profile/SKILL.md
@@ -0,0 +1,46 @@
+---
+name: raincloud-profile
+description: Use when the user asks to compute or refresh per-column statistics for a raincloud dataset — produces `outputs/v1/<slug>/profile.json` for the TUI's detail pane and `list_datasets --inspect`.
+---
+
+# raincloud-profile
+
+Wraps `python -m scripts.pipeline.profile`. Opt-in stage; off the default
+build path. Idempotent against parquet sha256.
+
+## When to invoke
+
+- "Generate profiles for X / for everything that's built"
+- "Refresh the profile after I rebuilt slug X"
+- "Show me per-column statistics" → run this first, then `list_datasets --inspect <slug>`
+
+## Patterns
+
+```bash
+python -m scripts.pipeline.profile <slug>                          # one
+python -m scripts.pipeline.profile --all                           # every built parquet
+python -m scripts.pipeline.profile --sample-rows 1000000 <slug>    # cap for huge slugs
+```
+
+Per-dtype stats:
+- **Numeric**: histogram + NDV + min/max/mean
+- **String / binary**: NDV; top-5 when NDV ≤ 256 (binary skips top-5 — bytes don't render usefully)
+- **Bool**: T/F/null counts
+- **Date/Timestamp**: range + 10-bucket histogram
+- **List/Map**: length min/max/mean
+- **Struct / variant**: skipped (emits null at column-map level)
+
+Profiles live at `outputs/v1/<slug>/profile.json` and are read by:
+- The TUI's right-pane Columns section (`python -m scripts.pipeline.browse`)
+- The CLI's `--inspect <slug>` rendering
+- `docs.py` for backfilling `shape_traits.high_cardinality_present` into `snapshot.json`
+
+The tracked mirror at `docs/v1/profiles/<slug>.json` is the fallback path that
+fresh clones ship — both `list_datasets --inspect` and the TUI's Columns pane
+read it when no built `outputs/v1/<slug>/profile.json` exists locally. The
+`profile` stage auto-runs `python -m scripts.pipeline.promote_profiles` at the
+end of a successful run (copies built profile.json into the tracked mirror,
+byte-identical mirror skipped), so the tracked snapshot stays in sync without
+a manual step. Pass `--no-promote` to suppress that auto-step while iterating;
+invoke `promote_profiles` explicitly (or with `--check`) to re-sync or audit
+the mirror after manual edits.
diff --git a/.agents/skills/raincloud-status/SKILL.md b/.agents/skills/raincloud-status/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: raincloud-status
 description: Report per-dataset state (raw / workdir / parquet / vortex / variant-pending) across the manifest. Use when the user asks what's downloaded, what's built, what's missing, what needs re-tightening, or to triage which slugs still need work.
-argument-hint: [<slug>...] [--family <name>] [--fast] [--missing-only] [--json]
+argument-hint: [<slug>...] [--fast] [--missing-only] [--json]
 allowed-tools: Bash(python -m scripts.pipeline.status *)
 ---
 
@@ -23,7 +23,6 @@ Walks the manifest and reports per-slug filesystem state in five columns:
 
 Selection (default: every slug in the manifest):
 - `<slug>...` — positional slugs
-- `--family <name>` — every dataset in a family
 - `--all` — explicit "every dataset" (the default, kept for parity with `/raincloud-build`)
 
 Modifiers:

diff --git a/.agents/skills/raincloud-validate-manifest/SKILL.md b/.agents/skills/raincloud-validate-manifest/SKILL.md
@@ -30,7 +30,7 @@ Modifiers:
 Exit code: `0` on success (warnings allowed), `1` on errors.
 
 When to invoke:
-- After editing `sources.json` (especially handler renames, family changes, slug additions).
+- After editing `sources.json` (especially handler renames, license changes, slug additions).
 - Before `/raincloud-build` on a fresh slug — catches typo'd handler names without paying for a fetch.
 - As the read-only counterpart to `/raincloud-status`: `/raincloud-status` reports filesystem state, `/raincloud-validate-manifest` reports manifest correctness.
 

diff --git a/AGENTS.md b/AGENTS.md
@@ -25,13 +25,13 @@ Validates `sources.json` against [`sources.schema.json`](sources.schema.json) (D
 For catalog queries that would otherwise require greping the ~545 KB `sources.json` (or scrolling ~158 KB of [`docs/v1/datasets.md`](docs/v1/datasets.md)):
 
 ```bash
-python -m scripts.pipeline.list_datasets --family uci --count
+python -m scripts.pipeline.list_datasets --handler uci_default --count
 python -m scripts.pipeline.list_datasets --handler tighten_types --long
 python -m scripts.pipeline.list_datasets --fetch-type kaggle --kaggle-tos
 python -m scripts.pipeline.list_datasets --grep '\bgeo' --long
 ```
 
-Filters compose with AND across `--family`, `--handler`, `--license`, `--fetch-type`, `--reader`, `--vortex` / `--no-vortex`, `--kaggle-tos`, `--grep`. Output modes: default (one slug per line), `--long` (wide table), `--json` (jq-friendly), `--count`.
+Filters compose with AND across `--handler`, `--license`, `--fetch-type`, `--reader`, `--vortex` / `--no-vortex`, `--kaggle-tos`, `--grep`. Output modes: default (one slug per line), `--long` (wide table), `--json` (jq-friendly), `--count`.
 
 If the user wants to *browse* interactively rather than query, point them at `python -m scripts.pipeline.browse` (read-only Textual TUI over the same data; requires `uv sync --extra tui --inexact`). It's a human-facing tool — don't try to run it from an agent context, since it won't render and will hang waiting for keystrokes.