[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
Open
sahrizvi wants to merge 1 commit into
Open
Conversation
Author
|
Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip
Contents:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Altimate Code — Leaderboard Submission (R27)
Agent name: Altimate Code
Project page: altimate.sh
gpt-5-chat, Chat Completions API)ground_truth.csvread path).Hints: Yes (
db_description_withhint.txtinjected into the user prompt; AutoContext document also dropped into each trial workspace)Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)
Prior submission
This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.
Result
The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream
mainat submission time.9031c68ad)634cd61ad)Validator-version note. Trials ran against
vendor/DataAgentBenchat commit9031c68ad. Upstream subsequently merged 18validate.pyupdates and one ground-truth correction (music_brainz_20k/q2:Amazon Music→iTunes) for the 12 datasets covered. Re-applying the latest validators (commit634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.Per-dataset stratified Pass@1
Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
db_description_withhint.txt(injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.ground_truth.csvaccess path exists in the author code (dab_bench/auto_context.py).schema_index,schema_search,schema_inspect,sql_execute,warehouse_list,fuzzy_match,explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.validate_shapeCLI that compares the draft ANSWER againstformat_hint.txtfor row/field/separator compliance (noground_truth.csvaccess).Reproducibility
The submission JSON in this PR contains 270 records of
{dataset, query, run, answer}— one per (dataset, query, trial). Per-trial event logs (events.jsonl),result.json, andstderr.logfor each sub-trial are available as an out-of-band trace bundle on request (~475 MB).Reproduce locally with:
Limitations disclosed for completeness
icd_o_3_histologyNOThistological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.🤖 PR description drafted with Claude Code
Edits: Formatting