Coding Agent Bench

Reproducible benchmarks for coding agents and models using Harbor

Leaderboards
- SWE-Bench Verified (pass@1, N=500)
- SWE-Bench Pro - Ansible Tasks (pass@1, N=96)
CLI Utility
- Prerequisites
- Run a Benchmark
Harbor Command Examples
- Prerequisites
- Table
Deploy models with vLLM
SWE-Bench Acceleration
- Use accelerated images for SWE-bench-verified
- Pre-pull base images
Run with Openshift
- Run Tasks in Openshift (Orchestrate Locally)
- Run Tasks and Orchestrate in Openshift
WIP

Leaderboards

✨ Check out our Coding Agent Leaderboard on HuggingFace ✨

SWE-Bench Verified (pass@1, N=500)

Model	Harness	Score	Cost
Sonnet 4.6	Claude Code	79.6%	N/A
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Pi	65.0%	$51^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Qwen Code	63.8%	$37^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Claude Code	63.2%	$48^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	OpenClaw	58.8%	$67^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	OpenCode	54.8%	$67^†

SWE-Bench Pro - Ansible Tasks (pass@1, N=96)

Model	Harness	Score	Cost
Sonnet 4.6	Claude Code	50.0%	$184
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Pi	47.9%	$13^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Claude Code	45.6%	$10^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	Qwen Code	43.8%	$9^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	OpenClaw	40.6%	$9^†
RedHatAI/Qwen3.6-35B-A3B-NVFP4	OpenCode	37.5%	$11^†

More coming soon...

^† - Cost estimates for OSS models are calculated by ($4 per A100 GPU hour × agent benchmark duration).

CLI Utility

The CLI utility will help you configure and run a benchmark jobs with Harbor for self-hosted models. It automatically constructs and runs the Harbor job command for your specified benchmark, agent, self-hosted model.

Prerequisites

Install dependencies with uv
```
uv sync
```
Set up a vLLM server, or other Anthropic- and OpenAI-compatible server
Select a benchmark from among the options in Harbor Hub

Run a Benchmark

The following is the minimal configuration needed to run a job with the CLI:

uv run coding-agent-bench \
    --agent <agent> \
    --dataset <benchmark-name> \
    --model-name <model-name> \
    --server-url <server-url>

For example, to run swe-bench/swe-bench-verified in Claude Code against a self-hosted model:

uv run coding-agent-bench \
    --agent claude-code \
    --dataset scale-ai/swe-bench-pro \
    --model-name my-model \
    --server-url http://my.server.url

If you want to see a preview of Harbor command that would be run for a given set of arguments without actually running the job, add the --dry-run flag.

Note

Additional configuration options are available, use uv run coding-agent-bench --help to see them.

Harbor Command Examples

Prerequisites

Install Harbor
Set up a vLLM server, or other Anthropic- and OpenAI-compatible server
Set your benchmark in your environment from among the options in Harbor Hub, e.g.:
```
export BENCHMARK='swe-bench/swe-bench-verified'
```
If you need to filter tasks in your benchmark by name, add the -i flag with your glob pattern to your harbor run command, e.g. -i "*ansible*"

Table

Harness	Model Server	Example	Status
Claude Code	vLLM	Link	Validated
Codex	vLLM	Link	Testing
OpenClaw	vLLM	Link	Validated
OpenCode	vLLM	Link	Validated
Pi	vLLM	Link	Validated
Qwen Code	vLLM	Link	Validated
Claude Code	VertexAI	Link	Validated

Note

To use with a locally hosted model (e.g. llama.cpp) use a vLLM example and set SERVER_URL=http://host.docker.internal:<server-port>

Claude Code vLLM

Set the following variables in your environ:

export SERVER_URL=
export MODEL_NAME=

Then run:

harbor run --agent claude-code -d $BENCHMARK \
    --ae ANTHROPIC_BASE_URL=$SERVER_URL \
    --ae ANTHROPIC_API_KEY='sk-no-key-required' \
    --ae ANTHROPIC_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_OPUS_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_SONNET_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_HAIKU_MODEL=$MODEL_NAME

Codex vLLM

Set the following variables in your environ:

export SERVER_URL=
export MODEL_NAME=

Use the utility script to create the config.toml file:

uv run scripts/codex_config_toml.py $MODEL_NAME $SERVER_URL

Then run:

harbor run --agent codex -d $BENCHMARK \
    -m vllm/$MODEL_NAME \
    --ae CODEX_HOME=/root/.codex/ \
    --mounts-json '[ { "type": "bind", "source":"/Users/taagarwa/Documents/Projects/coding-agent-bench/config.toml", "target": "/root/.codex/config.toml" } ]'

OpenClaw vLLM

Set the following variables in your environ:

export MODEL_NAME='qwen3.6-35b'
export SERVER_URL='http://qwen36-35b-qwen36-35b.apps.ocp-beta-test.nerc.mghpcc.org'
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'

Then run:

harbor run --agent openclaw -p $DATASET_DIR/swe-bench-verified \
    -m openai/$MODEL_NAME \
    --agent-kwarg thinking=off \
    --n-concurrent 8

OpenCode vLLM

Set the following variables in your environ:

export MODEL_NAME=

Set the content of your OpenCode config in your environ. Remember to replace the <server-url> with your vLLM server url and the <model-name> with your served model name:

export OPENCODE_CONFIG_CONTENT='{"$schema":"https://opencode.ai/config.json","model":"vllm/<model-name>","provider":{"vllm":{"npm":"@ai-sdk/openai-compatible","name":"vLLM","options":{"baseURL":"<server-url>"},"models":{"<model-name>":{"name":"<model-name>","limit":{"context":196500,"output":65500}}}}}}'

Then run:

harbor run --agent opencode -p $DATASET_DIR/swe-bench-verified \
    -m vllm/$MODEL_NAME \
    --ae "OPENCODE_CONFIG_CONTENT=$OPENCODE_CONFIG_CONTENT"

Pi vLLM

Set the following variables in your environ:

export MODEL_NAME=

Create a models.json file with your vLLM server information:

export PI_MODELS_JSON='{ "providers": { "vllm": { "baseUrl": "<server-url>", "api": "openai-completions", "apiKey": "NONE", "models": [{ "id": "gemma4-26b", "name": "<model-name>", "contextWindow": 262000 }] } } }'
echo $PI_MODELS_JSON > models.json

Then run:

harbor run --agent pi -d $BENCHMARK \
    -m vllm/$MODEL_NAME \
    --ae PI_OFFLINE=1 \
    --ae PI_CODING_AGENT_DIR=/root/.pi/agent \
    --mounts-json '[ { "type": "bind", "source":"/path/to/models.json", "target": "/root/.pi/agent/models.json" } ]'

Qwen Code vLLM

Set the following variables in your environ:

export MODEL_NAME=
export SERVER_URL=
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'

Then run:

harbor run --agent qwen-coder -d $BENCHMARK \
    -i $DATASET_PATTERN \
    -m $MODEL_NAME

Claude Code VertexAI

Set the following variables in your environ:

export CLOUD_ML_REGION=
export ANTHROPIC_VERTEX_PROJECT_ID=
export ANTHROPIC_MODEL=

Then run:

harbor run --agent claude-code -d $BENCHMARK \
    --ae CLAUDE_CODE_USE_VERTEX=1 \
    --ae CLOUD_ML_REGION=$CLOUD_ML_REGION \
    --ae ANTHROPIC_VERTEX_PROJECT_ID=$ANTHROPIC_VERTEX_PROJECT_ID \
    --ae ANTHROPIC_MODEL=$ANTHROPIC_MODEL \
    --ae GOOGLE_APPLICATION_CREDENTIALS='/app/.config/gcloud/application_default_credentials.json' \
    --mounts-json '["~/.config/gcloud/application_default_credentials.json:/app/.config/gcloud/application_default_credentials.json"]'

Deploy models with vLLM

Check out deploy/qwen-all-in-one.yml for a sample vLLM deployment of RedHatAI/Qwen3.6-35B-A3B-NVFP4.

Apply to your cluster by running:

oc apply -f deploy/qwen-all-in-one.yml

SWE-Bench Acceleration

Use accelerated images for SWE-bench-verified

Download the SWE-Bench-Verified tasks

harbor download swe-bench/swe-bench-verified

Replace images with the accelerated ones from Epoch AI

uv run scripts/replace_swe_bench_images.py <path-to-dataset>

Pre-pull base images

Download the dataset

harbor download <dataset>

Pull all the base images

uv run scripts/pull_images.py <path-to-dataset>

Run with Openshift

Run Tasks in Openshift (Orchestrate Locally)

Login to your cluster and select a project:

oc login --token=<token> --server=<server>
oc project <project>

Create ServiceAccounts and RoleBindings to run tasks:

oc apply -f deploy/harbor-task-sa.yml

Then in your harbor command, add the flag:

--environment-import-path coding_agent_bench.harbor_envs.openshift:OpenshiftEnvironment

Run Tasks and Orchestrate in Openshift

Login to your cluster and select a project:

oc login --token=<token> --server=<server>
oc project <project>

Create ServiceAccounts and RoleBindings to run tasks and orchestrate:

oc apply -f deploy/harbor-task-sa.yml
oc apply -f deploy/harbor-orchestrator-sa.yml

Create a MinIO deployment to store your job results:

oc apply -f deploy/harbor-minio.yml

Using the CLI, start a job with the --remote flag enabled and set --environment openshift, e.g.:

uv run coding-agent-bench \
    --agent claude-code \
    --dataset scale-ai/swe-bench-pro \
    --model-name my-model \
    --server-url http://my.server.url \
    --remote \
    --environment openshift

WIP

Run with Podman

Requires podman on PATH with a running Podman machine.

In your harbor command, add the flag:

--environment-import-path coding_agent_bench.harbor_envs.podman:PodmanEnvironment

Run with Gemini and Gemini CLI

export GOOGLE_CLOUD_PROJECT="<your-project>"

harbor run --agent gemini-cli -d $BENCHMARK \
    -m $MODEL_NAME

Run with vLLM and Gemini CLI

harbor run --agent gemini-cli -d $BENCHMARK \
    --ae GOOGLE_GEMINI_BASE_URL=$SERVER_URL \
    --ae GEMINI_MODEL=$MODEL_NAME \
    -m $MODEL_NAME

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
deploy		deploy
scripts		scripts
src/coding_agent_bench		src/coding_agent_bench
.gitignore		.gitignore
Containerfile		Containerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Coding Agent Bench

Leaderboards

✨ Check out our Coding Agent Leaderboard on HuggingFace ✨

SWE-Bench Verified (pass@1, N=500)

SWE-Bench Pro - Ansible Tasks (pass@1, N=96)

CLI Utility

Prerequisites

Run a Benchmark

Harbor Command Examples

Prerequisites

Table

Claude Code vLLM

Codex vLLM

OpenClaw vLLM

OpenCode vLLM

Pi vLLM

Qwen Code vLLM

Claude Code VertexAI

Deploy models with vLLM

SWE-Bench Acceleration

Use accelerated images for SWE-bench-verified

Pre-pull base images

Run with Openshift

Run Tasks in Openshift (Orchestrate Locally)

Run Tasks and Orchestrate in Openshift

WIP

Run with Podman

Run with Gemini and Gemini CLI

Run with vLLM and Gemini CLI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages