diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md new file mode 100644 index 000000000..30e1a90fd --- /dev/null +++ b/.ai/skills/audit-skill-md/SKILL.md @@ -0,0 +1,284 @@ + + +--- +name: audit-skill-md +description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires datafusion-python NN or newer" note. Run after upstream syncs and before each release. +argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all") +--- + +# Audit `skills/datafusion_python/SKILL.md` + +You are auditing the user-facing skill at +[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) +against the current state of the Python API. The skill is the source of truth +for how AI coding assistants are taught to write `datafusion-python` code, so +it must match what the project actually ships. This skill identifies gaps +caused by upstream syncs, refactors, or renames, and (if asked) applies the +edits directly to `SKILL.md`. + +The skill is most usefully run **after** the `check-upstream` step of an +upstream sync (see `dev/release/upstream-sync.md`) — once any new APIs are +exposed, this skill makes sure they get documented. + +## What the skill covers + +The user-facing `SKILL.md` documents these public surfaces. This list is not +exhaustive — if a new top-level area is added (e.g., a new `Catalog` API +exposed at the package root), include it. + +| Surface | Module | Sections in SKILL.md | +|---|---|---| +| `SessionContext` | `python/datafusion/context.py` | "Data Loading" | +| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" | +| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" | +| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout | +| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" | + +## Scope argument + +The user may specify a scope via `$ARGUMENTS` to limit the audit. If no scope +is given or `all` is specified, audit every area. + +| Scope | Audit target | +|---|---| +| `session-context` | `SessionContext` methods and the "Data Loading" section | +| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections | +| `expr` | `Expr` methods/operators and the "Expression Building" section | +| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section | +| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style | +| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream | +| `version-notes` | Cross-check version annotations (see below) | +| `all` | Everything above | + +## Inputs to read + +Before producing the report: + +1. `skills/datafusion_python/SKILL.md` — the document being audited. +2. The relevant Python module(s) for the chosen scope. Public surface is the + `__all__` list (where defined) plus `class` and `def` symbols not prefixed + with `_`. +3. `Cargo.toml` (root) for the current `datafusion-python` version — read + the `version` field under `[workspace.package]` (format `NN.0.0`). The + major version always matches the upstream `datafusion` crate, so a + single `datafusion-python` version expresses both. + `python/datafusion/__init__.py`'s `__version__` is the same value + exposed at runtime. +4. Recent commits touching the relevant module(s) for context on what + changed since the last sync: + ```bash + git log --oneline -- python/datafusion/dataframe.py | head -20 + ``` + +## What to look for + +Walk through each scoped area and flag four kinds of issues. + +### 1. New APIs not mentioned + +For each public symbol in the module's `__all__` (or each public class +method), check whether it appears anywhere in `SKILL.md`. A symbol is +"covered" if it shows up in: + +- A code block (the strongest signal — it's demonstrated). +- The "Available Functions (Categorized)" list. +- The SQL-to-DataFrame Reference table. + +**Decide whether each missing symbol deserves an entry.** Not every public +symbol belongs in `SKILL.md` — the skill is curated for the patterns users +hit daily, not exhaustive API reference. Use these heuristics: + +- **Add it** if it replaces or supersedes something already in the skill + (e.g., a new operation that is the idiomatic alternative to a documented + workaround). +- **Add it** if it fits a category already present (a new aggregate function + goes in the aggregate list; a new join type goes in the joining section). +- **Add it** if it changes how a documented pattern should be written. +- **Skip it** if it is genuinely niche / advanced / experimental. +- **Skip it** if it is internal plumbing exposed for FFI but not user-facing. + +When you flag a missing symbol, include a one-line proposed insertion point +(which section / which table row) so a reviewer can decide quickly. + +### 2. Stale mentions + +For each function name, method name, or import shown in `SKILL.md`, verify it +still exists in the current API: + +- Function names mentioned in prose or in the categorized list should appear + in `python/datafusion/functions.py`'s `__all__`. +- Method calls in code blocks should resolve against the current class. +- Imports (`from datafusion import ...`) should succeed against the current + `__init__.py`. + +A quick way to check imports without running them: + +```bash +python -c "from datafusion import SessionContext, col, lit; from datafusion import functions as F; print('ok')" +``` + +For each stale mention, propose either: +- a rename to the current name, or +- removal if the API is gone with no replacement. + +### 3. Examples that drifted from idiomatic style + +The skill teaches a Pythonic style: prefer plain strings to `col(...)` when a +column reference is all you need; prefer raw Python values to `lit(...)` +where auto-wrapping applies. Recent refactors (see the `make-pythonic` +skill) keep moving more functions toward accepting native types. + +For each code example in `SKILL.md`, check: + +- Does it use `lit(value)` where a raw value would work? Comparison RHS, + arithmetic with a column, etc. all auto-wrap. (Reserve `lit()` for the + cases listed in pitfall #2.) +- Does it use `col("name")` where a plain string would work? `select(...)`, + `aggregate([keys], ...)`, `sort(...)`, `sort_by(...)` all accept plain + name strings. +- Do `functions.py` calls match the current pythonic signature for that + function? If `make-pythonic` recently changed a signature (e.g., + `repeat(string, n: Expr | int)`), the example should pass `3` rather than + `lit(3)`. +- Does any example use a deprecated or removed parameter name? + +For drift, propose the updated snippet. If the change is purely stylistic +and the older form still works, mark the suggestion as **non-blocking**. + +### 4. Missing or stale version notes + +When an API depends on a specific version, the skill should say so — +otherwise an agent referencing the skill in an older project will write +code that fails at import or at runtime. + +`datafusion-python` shares its major version number with the upstream +`datafusion` crate (e.g., `datafusion-python 53.x` tracks upstream +`datafusion 53`). Always express version requirements in terms of +`datafusion-python` only — there is no need to call out upstream and +package versions separately. + +Add a version note when: + +- A method or function shown in the skill was added in a specific release + (e.g., a new `DataFrame` method that didn't exist before 53). +- A breaking change altered behavior in a specific release (signature + change, default-value change, new required argument). +- A pitfall was fixed in a specific release. Either annotate the pitfall + block with "fixed in datafusion-python NN, kept here for users on older + versions" or remove it once the supported floor moves past that version. + +Format for version notes (inline, italicized): + +```markdown +*Requires datafusion-python 53 or newer.* +``` + +For each missing/stale version note, propose the exact line and where it +belongs. + +## How to discover changes since the last audit + +If the user supplies a previous version or commit SHA where the audit was +last run, diff against it: + +```bash +# Public-API-relevant changes since SHA +git log --oneline ..HEAD -- python/datafusion/ + +# Whose signatures actually moved +git diff ..HEAD -- python/datafusion/functions.py | grep '^[+-]def ' +``` + +If no prior audit point is given, fall back to "since the last upstream +sync" by inspecting commits that touch `Cargo.toml`'s `datafusion` pin: + +```bash +git log --oneline -- Cargo.toml | grep -i datafusion | head -5 +``` + +## Output Format + +Produce a report grouped by scope. Each finding is one bullet with a +proposed action, so a maintainer can review the list quickly and apply +edits in order. + +``` +## SKILL.md Audit (scope: ) + +Audited against: +- skills/datafusion_python/SKILL.md @ +- datafusion-python + +### New APIs to cover +- `DataFrame.foo()` — added in datafusion-python 53. Insert in "DataFrame Operations Quick Reference" under . + Proposed snippet: + ```python + df.foo(...) + ``` + +### Stale mentions +- "old_function_name" referenced in the categorized list (line N) — renamed to "new_function_name". Replace. + +### Drifted examples +- "Filtering" section, `df.filter(col("a") > lit(10))` — drop `lit(10)`, auto-wrap applies. (non-blocking) +- "Aggregation" section, `df.aggregate([col("region")], ...)` — pass `"region"` as a plain string per "Projection" guidance. + +### Version notes +- `DataFrame.foo()` block needs *Requires datafusion-python 53 or newer.* +- "Common Pitfalls" #N — fixed in datafusion-python 53; remove the pitfall and update the SQL-to-DataFrame row to no longer flag the workaround. + +### No-change confirmed +- `SessionContext` data-loading section — all entries match current API. +``` + +If asked to apply the changes, edit `skills/datafusion_python/SKILL.md` +directly with `Edit` tool calls, one finding at a time, and re-run the +relevant doctest sanity check at the end: + +```bash +pytest --doctest-modules python/datafusion -q +``` + +## What NOT to flag + +- **Internal helpers / underscored names.** Private symbols are not part of + the user-facing surface. +- **Functions intentionally omitted.** Niche / advanced APIs (custom + catalogs, raw FFI plumbing, low-level execution plan accessors) live in + the API reference, not the skill. If an omission was deliberate and a + comment / commit explains why, leave it out. +- **Style nits inside explanatory prose.** The skill mixes example code and + prose; only enforce the pythonic style on actual code blocks. +- **Function-by-function coverage of every `functions.py` symbol.** The + "Available Functions (Categorized)" list is curated by category, not + exhaustive. Adding a single new aggregate to the aggregate list is + enough — the user follows the pointer to the API reference for the rest. + +## Coordination with other skills + +- Run `/check-upstream` first to expose any missing upstream APIs into the + Python layer. Without that, this skill cannot recommend documenting + something that is not yet exposed. +- Run `/make-pythonic` before this skill if a Pythonic-signature pass is + planned for a release — that way this skill can update examples to the + final signature in one shot rather than churning them twice. +- The order during an upstream sync (PR 3 of `dev/release/upstream-sync.md`) + is therefore: `/check-upstream` → `/make-pythonic` (optional) → + `/audit-skill-md`. diff --git a/dev/release/README.md b/dev/release/README.md index 4833be55a..1b02ca576 100644 --- a/dev/release/README.md +++ b/dev/release/README.md @@ -33,6 +33,12 @@ release branch without blocking ongoing development in the `main` branch. We can cherry-pick commits from the `main` branch into `branch-53` as needed and then create new patch releases from that branch. +## Upstream Sync + +Between releases the `main` branch is periodically synced to a newer upstream `apache/datafusion` version. This is +broken into a three-PR workflow (bump + fix breakage, consolidate transitive deps, fill API and documentation gaps). +See [`upstream-sync.md`](upstream-sync.md) for the full process. + ## Detailed Guide ### Pre-requisites @@ -53,6 +59,9 @@ You will also need access to the [datafusion](https://test.pypi.org/project/data Before creating a new release: - We need to ensure that the main branch does not have any GitHub dependencies +- Confirm the upstream sync workflow in [`upstream-sync.md`](upstream-sync.md) has been completed for this release cycle + (crate bump + breakage fixes, transitive dependency consolidation, and the `/check-upstream` and `/audit-skill-md` + passes). Any gaps surfaced by those skills should land before the release branch is cut. - a PR should be created and merged to update the major version number of the project - A new release branch should be created, such as `branch-53` - It is best to push this branch to the apache repository rather than a personal fork in case patch releases are required. diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md new file mode 100644 index 000000000..dc05b4195 --- /dev/null +++ b/dev/release/upstream-sync.md @@ -0,0 +1,170 @@ + + +# Upstream Sync Process + +This document describes how to sync `datafusion-python` to a new version of the +upstream `apache/datafusion` Rust crates. This is a recurring task: between +official releases the `main` branch tracks DataFusion via crates.io or GitHub +dependencies, and we periodically bump those dependencies to pick up new +features and bug fixes. + +The work is broken into **three sequential PRs** rather than landing as one +large change. Splitting reviews along these lines keeps each PR focused, makes +breakage easier to bisect, and lets reviewers concentrate on one concern at a +time. + +## PR 1: Bump DataFusion crate dependencies and fix breakage + +**Goal:** update the upstream `datafusion` crate version and make the project +build, test, and lint cleanly against it. + +1. In the root `Cargo.toml`, update: + - `[workspace.package].version` to the new major (the `datafusion-python` + major tracks the upstream `datafusion` major, so a 53→54 bump moves + this from `53.0.0` to `54.0.0`), and + - every `datafusion` / `datafusion-*` entry in `[workspace.dependencies]` + to the same new major. + + Per-crate manifests under `crates/` inherit these pins via + `workspace = true` and need no edit. +2. Update `Cargo.lock` for the datafusion family only — leave unrelated + transitives at their current pins so PR 2 can address them deliberately. + List every `datafusion-*` workspace dependency with `-p`: + + ```bash + cargo update \ + -p datafusion \ + -p datafusion-substrait \ + -p datafusion-proto \ + -p datafusion-ffi \ + -p datafusion-catalog \ + -p datafusion-common \ + -p datafusion-functions-aggregate \ + -p datafusion-functions-window \ + -p datafusion-expr + ``` + + Or pin exact versions with `--precise`, one crate at a time: + + ```bash + cargo update -p datafusion --precise 54.0.0 + # repeat for each datafusion-* sibling + ``` + + A bare `cargo update` would refresh every transitive crate and blur the + diff between PR 1 and PR 2. +3. Run the standard build and test commands and address compilation errors, + API renames, signature changes, and behavior changes: + - `cargo build` + - `cargo test` + - `pytest` + - `pre-commit run --all-files` +4. Fix only what's needed to restore green CI. Resist the urge to bundle + unrelated cleanups — those belong in their own PR. +5. If a breaking change in upstream requires a user-facing API change in + `datafusion-python`, add the `api change` label and document the change + in the PR description so it surfaces in the changelog. + +**Reference PRs:** [#1311](https://github.com/apache/datafusion-python/pull/1311) +(DF51), [#1337](https://github.com/apache/datafusion-python/pull/1337) (DF52). + +## PR 2: Consolidate transitive dependencies + +**Goal:** after the upstream bump, the dependency tree may have multiple +versions of the same transitive crate (for example, two `arrow` versions, two +`object_store` versions). Reconcile these so we ship a single coherent set. + +1. Inspect the lockfile for duplicates: + ```bash + cargo tree --duplicates + ``` +2. For each duplicate that matters (Arrow, `object_store`, `parquet`, + `tokio`, `arrow-flight`, etc.), update our direct dependency declarations + in `Cargo.toml` to versions compatible with what upstream DataFusion now + pulls in. The goal is one version of each ecosystem-critical crate. +3. Re-run `cargo update` and re-run the full test matrix. Some duplicates are + benign (small leaf crates with no FFI surface) and can be left alone if + reconciliation would force a much larger change. Use judgment. +4. If consolidating forces a behavioral change visible to users (for example, + a newer `pyarrow`-compatible Arrow version), call it out in the PR + description. + +Keeping this work separate from PR 1 means PR 1 stays a "make it compile" +review and PR 2 stays a "tidy the dependency graph" review. + +## PR 3: Fill API and documentation gaps + +**Goal:** with the upstream version locked in, identify new APIs that landed +upstream and decide whether to expose them, and update agent-facing +documentation so it still matches the surface we ship. + +1. Run the `check-upstream` skill (`.ai/skills/check-upstream/SKILL.md`) to + diff the upstream Rust API against what's exposed in + `python/datafusion/`. The skill covers scalar/aggregate/window/table + functions, `DataFrame` methods, `SessionContext` methods, and FFI types. + Invoke it from the assistant with `/check-upstream` (optionally scoped to + one area, e.g. `/check-upstream scalar functions`). +2. For each gap, decide whether to: + - Expose it now (small, obvious additions can land in this PR). + - File a tracking issue (anything non-trivial — separate PR per feature + keeps reviews focused). + - Skip it (internal-only or already covered by an existing API; record + the decision in the "Evaluated and not requiring exposure" sections of + the skill so future runs don't re-flag it). +3. (Optional) Run the `make-pythonic` skill + (`.ai/skills/make-pythonic/SKILL.md`) over any newly exposed APIs to + align signatures with the project's Pythonic style (accepting plain + strings for column names, raw Python values where auto-wrapping + applies, etc.). Invoke it from the assistant with `/make-pythonic`. + Running this *before* the audit step means examples in `SKILL.md` get + updated to the final signature in one pass instead of churning twice. + Larger reshapes still belong in their own PR. +4. Run the `audit-skill-md` skill (`.ai/skills/audit-skill-md/SKILL.md`) to + cross-reference the user-facing skill at + [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) + against the current public API. The skill flags stale function names, + missing newly exposed APIs, examples that drifted from idiomatic style, + and missing version notes. Invoke it from the assistant with + `/audit-skill-md` (optionally scoped, e.g. `/audit-skill-md dataframe`). + Apply the resulting edits to `SKILL.md` and to the relevant RST pages + under `docs/source/user-guide/common-operations/`. +5. If new aggregate or window functions were exposed in step 2, also update: + - `docs/source/user-guide/common-operations/aggregations.rst` + - `docs/source/user-guide/common-operations/windows.rst` + +## Why three PRs + +- **Bisectable.** If a regression appears, `git bisect` lands on the + responsible PR (compile fix, dependency consolidation, or API addition) + rather than a single mega-commit. +- **Reviewable.** Each PR has a single concern. Reviewers reading PR 1 don't + need to also reason about whether new APIs are well-named. +- **Skippable.** Some upstream syncs are pure version bumps with no new APIs + worth exposing. PR 3 can be empty or merged as a no-op if the audit comes + back clean. + +## Related documents + +- [`README.md`](README.md) — the broader release process (this sync work + feeds into the next official release). +- [`.ai/skills/check-upstream/SKILL.md`](../../.ai/skills/check-upstream/SKILL.md) + — API coverage audit. +- [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) + — user-facing agent guide kept in sync via PR 3.