Skip to content

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53

Open
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28
Open

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28

Conversation

@sahrizvi
Copy link
Copy Markdown

@sahrizvi sahrizvi commented May 27, 2026

Altimate Code — Leaderboard Submission (R27)

Agent name: Altimate Code
Project page: altimate.sh

Component Model Role
Trial-time backbone GPT-5.5 (Azure AI Foundry, deployment alias gpt-5-chat, Chat Completions API) Answers each of the 270 trials
AutoContext author Claude Sonnet 4.6 (Google Vertex AI) One-shot per dataset — produces a schema-orientation document (joins, encodings, format quirks, sampled rows). GT-firewalled (no ground_truth.csv read path).

Hints: Yes (db_description_withhint.txt injected into the user prompt; AutoContext document also dropped into each trial workspace)
Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)

Prior submission

This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.

Result

The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream main at submission time.

Metric Trial-time validators (9031c68ad) Latest validators (634cd61ad)
Stratified Pass@1 (leaderboard metric) 0.6893 0.6893
Micro Pass@1 (passes / trials) 0.7444 (201 / 270) 0.7444 (201 / 270)
Total trials 270 270

Validator-version note. Trials ran against vendor/DataAgentBench at commit 9031c68ad. Upstream subsequently merged 18 validate.py updates and one ground-truth correction (music_brainz_20k/q2: Amazon MusiciTunes) for the 12 datasets covered. Re-applying the latest validators (commit 634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.

Per-dataset stratified Pass@1

Dataset Pass@1 Trials
agnews 1.000 20/20
bookreview 1.000 15/15
stockindex 1.000 15/15
yelp 0.971 34/35
crmarenapro 0.831 54/65
googlelocal 0.750 15/20
stockmarket 0.720 18/25
music_brainz_20k 0.667 10/15
DEPS_DEV_V1 0.500 5/10
GITHUB_REPOS 0.500 10/20
PANCANCER_ATLAS 0.333 5/15
PATENTS 0.000 0/15

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

  1. Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
  2. Reads an AutoContext document (per-dataset, authored by Claude Sonnet 4.6 once before any trials run) for schema-orientation notes — column annotations, verified join keys, sample-row encodings, NULL semantics, and entity-resolution caveats. Sonnet authors this with access to the warehouse schema and 5 sampled rows per table; no ground_truth.csv access path exists in the author code (dab_bench/auto_context.py).
  3. Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list, fuzzy_match, explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
  4. Validates output shape before commit via a validate_shape CLI that compares the draft ANSWER against format_hint.txt for row/field/separator compliance (no ground_truth.csv access).
  5. Aggregates K=3 sub-trials per trial via exact-majority on the ANSWER file string. Top count wins; ties break to the first.

Reproducibility

The submission JSON in this PR contains 270 records of {dataset, query, run, answer} — one per (dataset, query, trial). Per-trial event logs (events.jsonl), result.json, and stderr.log for each sub-trial are available as an out-of-band trace bundle on request (~475 MB).

Reproduce locally with:

# Prereqs: postgres on :55432, mongodb on :57017, an Azure AI Foundry deployment
# named \"gpt-5-chat\" fronted by scripts_python/azure_foundry_proxy.py (rewrites
# max_tokens → max_completion_tokens), and a Google Vertex project for the
# Sonnet 4.6 AutoContext author pass.

GOOGLE_CLOUD_PROJECT=<your-gcp-project> GOOGLE_CLOUD_LOCATION=us-east5 \
AZURE_FOUNDRY_BASE_URL=http://localhost:9997/v1 \
AZURE_FOUNDRY_API_KEY=<your-azure-foundry-key> \
AZURE_FOUNDRY_MODELS=gpt-5-chat \
PG_HOST=127.0.0.1 PG_PORT=55432 PG_USER=postgres PG_PASSWORD=postgres \
MONGO_URI=\"mongodb://127.0.0.1:57017/\" \
uv run python scripts_python/run_benchmark.py \
  --dab-root vendor/DataAgentBench \
  --datasets agnews bookreview crmarenapro DEPS_DEV_V1 GITHUB_REPOS googlelocal \
             music_brainz_20k PANCANCER_ATLAS PATENTS stockindex stockmarket yelp \
  --trials 5 --concurrency 5 --consensus-k 3 \
  --profile bash --runtime altimate \
  --model azure-foundry/gpt-5-chat \
  --max-turns 75 --timeout-sec 2000 --yolo --prepare-external \
  --autocontext --autocontext-model claude-sonnet-4-6 \
  --experiments-dir baseline_runs --run-name run27_azure_gpt55

Limitations disclosed for completeness

  • PATENTS (0/15): every PATENTS trial produced a well-formed CSV answer but matched a different subset of CPC codes than the reference set. Failure mode is query-interpretation (EMA initialization convention and CPC hierarchy-level definition are under-specified in the question), not format or harness. We chose not to add per-dataset hand-tuning.
  • PANCANCER_ATLAS (5/15): concentrated on q2/q3 mis-grouping (descriptive vs coded histology column). The AutoContext Operational Rule for histology was authored unconditionally ("use icd_o_3_histology NOT histological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.

🤖 PR description drafted with Claude Code
Edits: Formatting

@sahrizvi
Copy link
Copy Markdown
Author

Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip

dab-submission-r27-2026-05-28.zip (15 MB compressed → 475 MB extracted) — a .zip wrapper around a .tar.xz (GitHub accepts
.zip but not .tar.xz directly). Extract with:

  unzip dab-submission-r27-2026-05-28.zip                                                                                      
  tar -xJf dab-submission-r27-2026-05-28.tar.xz                                                                                

Contents:

  • agent_description.md — same as in the PR body
  • submission.json — identical to leaderboard_submissions/altimate-code_gpt-5.5_n5.json in the PR diff
  • rescore.json — per-trial pass/fail under upstream HEAD 634cd61ad
  • r27_traces/trials/<dataset>_query<N>_trial<M>/ — for each of the 270 trials:
    • Top-level: ANSWER, result.json, consensus.json (the K=3 consensus output)
    • sub0/, sub1/, sub2/: per-sub events.jsonl (full agent action log), ANSWER, stderr.log
    • Per-sub workspace/ is excluded (50 MB each — contains the materialized DBs the agent queried; recoverable via
      --prepare-external).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant