[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893 by sahrizvi · Pull Request #53 · ucbepic/DataAgentBench

sahrizvi · 2026-05-27T20:14:12Z

Altimate Code — Leaderboard Submission (R27)

Agent name: Altimate Code
Project page: altimate.sh

Component	Model	Role
Trial-time backbone	GPT-5.5 (Azure AI Foundry, deployment alias `gpt-5-chat`, Chat Completions API)	Answers each of the 270 trials
AutoContext author	Claude Sonnet 4.6 (Google Vertex AI)	One-shot per dataset — produces a schema-orientation document (joins, encodings, format quirks, sampled rows). GT-firewalled (no `ground_truth.csv` read path).

Hints: Yes (db_description_withhint.txt injected into the user prompt; AutoContext document also dropped into each trial workspace)
Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)

Prior submission

This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.

Result

The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream main at submission time.

Metric	Trial-time validators (`9031c68ad`)	Latest validators (`634cd61ad`)
Stratified Pass@1 (leaderboard metric)	0.6893	0.6893
Micro Pass@1 (passes / trials)	0.7444 (201 / 270)	0.7444 (201 / 270)
Total trials	270	270

Validator-version note. Trials ran against vendor/DataAgentBench at commit 9031c68ad. Upstream subsequently merged 18 validate.py updates and one ground-truth correction (music_brainz_20k/q2: Amazon Music → iTunes) for the 12 datasets covered. Re-applying the latest validators (commit 634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.

Per-dataset stratified Pass@1

Dataset	Pass@1	Trials
agnews	1.000	20/20
bookreview	1.000	15/15
stockindex	1.000	15/15
yelp	0.971	34/35
crmarenapro	0.831	54/65
googlelocal	0.750	15/20
stockmarket	0.720	18/25
music_brainz_20k	0.667	10/15
DEPS_DEV_V1	0.500	5/10
GITHUB_REPOS	0.500	10/20
PANCANCER_ATLAS	0.333	5/15
PATENTS	0.000	0/15

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
Reads an AutoContext document (per-dataset, authored by Claude Sonnet 4.6 once before any trials run) for schema-orientation notes — column annotations, verified join keys, sample-row encodings, NULL semantics, and entity-resolution caveats. Sonnet authors this with access to the warehouse schema and 5 sampled rows per table; no ground_truth.csv access path exists in the author code (dab_bench/auto_context.py).
Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list, fuzzy_match, explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
Validates output shape before commit via a validate_shape CLI that compares the draft ANSWER against format_hint.txt for row/field/separator compliance (no ground_truth.csv access).
Aggregates K=3 sub-trials per trial via exact-majority on the ANSWER file string. Top count wins; ties break to the first.

Reproducibility

The submission JSON in this PR contains 270 records of {dataset, query, run, answer} — one per (dataset, query, trial). Per-trial event logs (events.jsonl), result.json, and stderr.log for each sub-trial are available as an out-of-band trace bundle on request (~475 MB).

Reproduce locally with:

# Prereqs: postgres on :55432, mongodb on :57017, an Azure AI Foundry deployment
# named \"gpt-5-chat\" fronted by scripts_python/azure_foundry_proxy.py (rewrites
# max_tokens → max_completion_tokens), and a Google Vertex project for the
# Sonnet 4.6 AutoContext author pass.

GOOGLE_CLOUD_PROJECT=<your-gcp-project> GOOGLE_CLOUD_LOCATION=us-east5 \
AZURE_FOUNDRY_BASE_URL=http://localhost:9997/v1 \
AZURE_FOUNDRY_API_KEY=<your-azure-foundry-key> \
AZURE_FOUNDRY_MODELS=gpt-5-chat \
PG_HOST=127.0.0.1 PG_PORT=55432 PG_USER=postgres PG_PASSWORD=postgres \
MONGO_URI=\"mongodb://127.0.0.1:57017/\" \
uv run python scripts_python/run_benchmark.py \
  --dab-root vendor/DataAgentBench \
  --datasets agnews bookreview crmarenapro DEPS_DEV_V1 GITHUB_REPOS googlelocal \
             music_brainz_20k PANCANCER_ATLAS PATENTS stockindex stockmarket yelp \
  --trials 5 --concurrency 5 --consensus-k 3 \
  --profile bash --runtime altimate \
  --model azure-foundry/gpt-5-chat \
  --max-turns 75 --timeout-sec 2000 --yolo --prepare-external \
  --autocontext --autocontext-model claude-sonnet-4-6 \
  --experiments-dir baseline_runs --run-name run27_azure_gpt55

Limitations disclosed for completeness

PATENTS (0/15): every PATENTS trial produced a well-formed CSV answer but matched a different subset of CPC codes than the reference set. Failure mode is query-interpretation (EMA initialization convention and CPC hierarchy-level definition are under-specified in the question), not format or harness. We chose not to add per-dataset hand-tuning.
PANCANCER_ATLAS (5/15): concentrated on q2/q3 mis-grouping (descriptive vs coded histology column). The AutoContext Operational Rule for histology was authored unconditionally ("use icd_o_3_histology NOT histological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.

🤖 PR description drafted with Claude Code
Edits: Formatting

sahrizvi · 2026-05-28T06:04:07Z

Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip

dab-submission-r27-2026-05-28.zip (15 MB compressed → 475 MB extracted) — a .zip wrapper around a .tar.xz (GitHub accepts
.zip but not .tar.xz directly). Extract with:

  unzip dab-submission-r27-2026-05-28.zip                                                                                      
  tar -xJf dab-submission-r27-2026-05-28.tar.xz

Contents:

agent_description.md — same as in the PR body
submission.json — identical to leaderboard_submissions/altimate-code_gpt-5.5_n5.json in the PR diff
rescore.json — per-trial pass/fail under upstream HEAD 634cd61ad
r27_traces/trials/<dataset>_query<N>_trial<M>/ — for each of the 270 trials:
- Top-level: ANSWER, result.json, consensus.json (the K=3 consensus output)
- sub0/, sub1/, sub2/: per-sub events.jsonl (full agent action log), ANSWER, stderr.log
- Per-sub workspace/ is excluded (50 MB each — contains the materialized DBs the agent queried; recoverable via
  --prepare-external).

Submit Altimate Code (GPT-5.5) — 270 trials, 0.6893 stratified Pass@1

3b97f76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28

sahrizvi commented May 27, 2026 •

edited

Loading

Uh oh!

sahrizvi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sahrizvi commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Altimate Code — Leaderboard Submission (R27)

Prior submission

Result

Per-dataset stratified Pass@1

Architecture

Reproducibility

Limitations disclosed for completeness

Uh oh!

sahrizvi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sahrizvi commented May 27, 2026 •

edited

Loading