Reproducible benchmarks for coding agents and models using Harbor
- Leaderboards
- CLI Utility
- Harbor Command Examples
- Deploy models with vLLM
- SWE-Bench Acceleration
- Run with Openshift
- WIP
| Model | Harness | Score | Cost |
|---|---|---|---|
| Sonnet 4.6 | Claude Code | 79.6% | N/A |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Pi | 65.0% | $51† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Qwen Code | 63.8% | $37† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Claude Code | 63.2% | $48† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | OpenClaw | 58.8% | $67† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | OpenCode | 54.8% | $67† |
| Model | Harness | Score | Cost |
|---|---|---|---|
| Sonnet 4.6 | Claude Code | 50.0% | $184 |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Pi | 47.9% | $13† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Claude Code | 45.6% | $10† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | Qwen Code | 43.8% | $9† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | OpenClaw | 40.6% | $9† |
| RedHatAI/Qwen3.6-35B-A3B-NVFP4 | OpenCode | 37.5% | $11† |
More coming soon...
† - Cost estimates for OSS models are calculated by ($4 per A100 GPU hour × agent benchmark duration).
The CLI utility will help you configure and run a benchmark jobs with Harbor for self-hosted models. It automatically constructs and runs the Harbor job command for your specified benchmark, agent, self-hosted model.
-
Install dependencies with uv
uv sync
-
Set up a vLLM server, or other Anthropic- and OpenAI-compatible server
-
Select a benchmark from among the options in Harbor Hub
The following is the minimal configuration needed to run a job with the CLI:
uv run coding-agent-bench \
--agent <agent> \
--dataset <benchmark-name> \
--model-name <model-name> \
--server-url <server-url>For example, to run swe-bench/swe-bench-verified in Claude Code against a self-hosted model:
uv run coding-agent-bench \
--agent claude-code \
--dataset scale-ai/swe-bench-pro \
--model-name my-model \
--server-url http://my.server.urlIf you want to see a preview of Harbor command that would be run for a given set of arguments without actually running the job, add the --dry-run flag.
Note
Additional configuration options are available, use uv run coding-agent-bench --help to see them.
-
Install Harbor
-
Set up a vLLM server, or other Anthropic- and OpenAI-compatible server
-
Set your benchmark in your environment from among the options in Harbor Hub, e.g.:
export BENCHMARK='swe-bench/swe-bench-verified'
-
If you need to filter tasks in your benchmark by name, add the
-iflag with your glob pattern to yourharbor runcommand, e.g.-i "*ansible*"
| Harness | Model Server | Example | Status |
|---|---|---|---|
| Claude Code | vLLM | Link | Validated |
| Codex | vLLM | Link | Testing |
| OpenClaw | vLLM | Link | Validated |
| OpenCode | vLLM | Link | Validated |
| Pi | vLLM | Link | Validated |
| Qwen Code | vLLM | Link | Validated |
| Claude Code | VertexAI | Link | Validated |
Note
To use with a locally hosted model (e.g. llama.cpp) use a vLLM example and set SERVER_URL=http://host.docker.internal:<server-port>
Set the following variables in your environ:
export SERVER_URL=
export MODEL_NAME=Then run:
harbor run --agent claude-code -d $BENCHMARK \
--ae ANTHROPIC_BASE_URL=$SERVER_URL \
--ae ANTHROPIC_API_KEY='sk-no-key-required' \
--ae ANTHROPIC_MODEL=$MODEL_NAME \
--ae ANTHROPIC_DEFAULT_OPUS_MODEL=$MODEL_NAME \
--ae ANTHROPIC_DEFAULT_SONNET_MODEL=$MODEL_NAME \
--ae ANTHROPIC_DEFAULT_HAIKU_MODEL=$MODEL_NAMESet the following variables in your environ:
export SERVER_URL=
export MODEL_NAME=Use the utility script to create the config.toml file:
uv run scripts/codex_config_toml.py $MODEL_NAME $SERVER_URLThen run:
harbor run --agent codex -d $BENCHMARK \
-m vllm/$MODEL_NAME \
--ae CODEX_HOME=/root/.codex/ \
--mounts-json '[ { "type": "bind", "source":"/Users/taagarwa/Documents/Projects/coding-agent-bench/config.toml", "target": "/root/.codex/config.toml" } ]'Set the following variables in your environ:
export MODEL_NAME='qwen3.6-35b'
export SERVER_URL='http://qwen36-35b-qwen36-35b.apps.ocp-beta-test.nerc.mghpcc.org'
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'Then run:
harbor run --agent openclaw -p $DATASET_DIR/swe-bench-verified \
-m openai/$MODEL_NAME \
--agent-kwarg thinking=off \
--n-concurrent 8Set the following variables in your environ:
export MODEL_NAME=Set the content of your OpenCode config in your environ. Remember to replace the <server-url> with your vLLM server url and the <model-name> with your served model name:
export OPENCODE_CONFIG_CONTENT='{"$schema":"https://opencode.ai/config.json","model":"vllm/<model-name>","provider":{"vllm":{"npm":"@ai-sdk/openai-compatible","name":"vLLM","options":{"baseURL":"<server-url>"},"models":{"<model-name>":{"name":"<model-name>","limit":{"context":196500,"output":65500}}}}}}'Then run:
harbor run --agent opencode -p $DATASET_DIR/swe-bench-verified \
-m vllm/$MODEL_NAME \
--ae "OPENCODE_CONFIG_CONTENT=$OPENCODE_CONFIG_CONTENT"Set the following variables in your environ:
export MODEL_NAME=Create a models.json file with your vLLM server information:
export PI_MODELS_JSON='{ "providers": { "vllm": { "baseUrl": "<server-url>", "api": "openai-completions", "apiKey": "NONE", "models": [{ "id": "gemma4-26b", "name": "<model-name>", "contextWindow": 262000 }] } } }'
echo $PI_MODELS_JSON > models.jsonThen run:
harbor run --agent pi -d $BENCHMARK \
-m vllm/$MODEL_NAME \
--ae PI_OFFLINE=1 \
--ae PI_CODING_AGENT_DIR=/root/.pi/agent \
--mounts-json '[ { "type": "bind", "source":"/path/to/models.json", "target": "/root/.pi/agent/models.json" } ]'Set the following variables in your environ:
export MODEL_NAME=
export SERVER_URL=
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'Then run:
harbor run --agent qwen-coder -d $BENCHMARK \
-i $DATASET_PATTERN \
-m $MODEL_NAMESet the following variables in your environ:
export CLOUD_ML_REGION=
export ANTHROPIC_VERTEX_PROJECT_ID=
export ANTHROPIC_MODEL=Then run:
harbor run --agent claude-code -d $BENCHMARK \
--ae CLAUDE_CODE_USE_VERTEX=1 \
--ae CLOUD_ML_REGION=$CLOUD_ML_REGION \
--ae ANTHROPIC_VERTEX_PROJECT_ID=$ANTHROPIC_VERTEX_PROJECT_ID \
--ae ANTHROPIC_MODEL=$ANTHROPIC_MODEL \
--ae GOOGLE_APPLICATION_CREDENTIALS='/app/.config/gcloud/application_default_credentials.json' \
--mounts-json '["~/.config/gcloud/application_default_credentials.json:/app/.config/gcloud/application_default_credentials.json"]'Check out deploy/qwen-all-in-one.yml for a sample vLLM deployment of RedHatAI/Qwen3.6-35B-A3B-NVFP4.
Apply to your cluster by running:
oc apply -f deploy/qwen-all-in-one.yml- Download the SWE-Bench-Verified tasks
harbor download swe-bench/swe-bench-verified- Replace images with the accelerated ones from Epoch AI
uv run scripts/replace_swe_bench_images.py <path-to-dataset>- Download the dataset
harbor download <dataset>- Pull all the base images
uv run scripts/pull_images.py <path-to-dataset>Login to your cluster and select a project:
oc login --token=<token> --server=<server>
oc project <project>Create ServiceAccounts and RoleBindings to run tasks:
oc apply -f deploy/harbor-task-sa.ymlThen in your harbor command, add the flag:
--environment-import-path coding_agent_bench.harbor_envs.openshift:OpenshiftEnvironmentLogin to your cluster and select a project:
oc login --token=<token> --server=<server>
oc project <project>Create ServiceAccounts and RoleBindings to run tasks and orchestrate:
oc apply -f deploy/harbor-task-sa.yml
oc apply -f deploy/harbor-orchestrator-sa.ymlCreate a MinIO deployment to store your job results:
oc apply -f deploy/harbor-minio.ymlUsing the CLI, start a job with the --remote flag enabled and set --environment openshift, e.g.:
uv run coding-agent-bench \
--agent claude-code \
--dataset scale-ai/swe-bench-pro \
--model-name my-model \
--server-url http://my.server.url \
--remote \
--environment openshiftRequires podman on PATH with a running Podman machine.
In your harbor command, add the flag:
--environment-import-path coding_agent_bench.harbor_envs.podman:PodmanEnvironmentexport GOOGLE_CLOUD_PROJECT="<your-project>"
harbor run --agent gemini-cli -d $BENCHMARK \
-m $MODEL_NAMEharbor run --agent gemini-cli -d $BENCHMARK \
--ae GOOGLE_GEMINI_BASE_URL=$SERVER_URL \
--ae GEMINI_MODEL=$MODEL_NAME \
-m $MODEL_NAME