Skip to content

redhat-et/coding_agent_bench

Repository files navigation

Coding Agent Bench

Reproducible benchmarks for coding agents and models using Harbor

Leaderboards

SWE-Bench Verified (pass@1, N=500)

Model Harness Score Cost
Sonnet 4.6 Claude Code 79.6% N/A
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Pi 65.0% $51
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Qwen Code 63.8% $37
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Claude Code 63.2% $48
RedHatAI/Qwen3.6-35B-A3B-NVFP4 OpenClaw 58.8% $67
RedHatAI/Qwen3.6-35B-A3B-NVFP4 OpenCode 54.8% $67

SWE-Bench Pro - Ansible Tasks (pass@1, N=96)

Model Harness Score Cost
Sonnet 4.6 Claude Code 50.0% $184
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Pi 47.9% $13
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Claude Code 45.6% $10
RedHatAI/Qwen3.6-35B-A3B-NVFP4 Qwen Code 43.8% $9
RedHatAI/Qwen3.6-35B-A3B-NVFP4 OpenClaw 40.6% $9
RedHatAI/Qwen3.6-35B-A3B-NVFP4 OpenCode 37.5% $11

More coming soon...

- Cost estimates for OSS models are calculated by ($4 per A100 GPU hour × agent benchmark duration).

CLI Utility

The CLI utility will help you configure and run a benchmark jobs with Harbor for self-hosted models. It automatically constructs and runs the Harbor job command for your specified benchmark, agent, self-hosted model.

Prerequisites

  • Install dependencies with uv

    uv sync
  • Set up a vLLM server, or other Anthropic- and OpenAI-compatible server

  • Select a benchmark from among the options in Harbor Hub

Run a Benchmark

The following is the minimal configuration needed to run a job with the CLI:

uv run coding-agent-bench \
    --agent <agent> \
    --dataset <benchmark-name> \
    --model-name <model-name> \
    --server-url <server-url>

For example, to run swe-bench/swe-bench-verified in Claude Code against a self-hosted model:

uv run coding-agent-bench \
    --agent claude-code \
    --dataset scale-ai/swe-bench-pro \
    --model-name my-model \
    --server-url http://my.server.url

If you want to see a preview of Harbor command that would be run for a given set of arguments without actually running the job, add the --dry-run flag.

Note

Additional configuration options are available, use uv run coding-agent-bench --help to see them.

Harbor Command Examples

Prerequisites

  • Install Harbor

  • Set up a vLLM server, or other Anthropic- and OpenAI-compatible server

  • Set your benchmark in your environment from among the options in Harbor Hub, e.g.:

    export BENCHMARK='swe-bench/swe-bench-verified'
  • If you need to filter tasks in your benchmark by name, add the -i flag with your glob pattern to your harbor run command, e.g. -i "*ansible*"

Table

Harness Model Server Example Status
Claude Code vLLM Link Validated
Codex vLLM Link Testing
OpenClaw vLLM Link Validated
OpenCode vLLM Link Validated
Pi vLLM Link Validated
Qwen Code vLLM Link Validated
Claude Code VertexAI Link Validated

Note

To use with a locally hosted model (e.g. llama.cpp) use a vLLM example and set SERVER_URL=http://host.docker.internal:<server-port>

Claude Code vLLM

Set the following variables in your environ:

export SERVER_URL=
export MODEL_NAME=

Then run:

harbor run --agent claude-code -d $BENCHMARK \
    --ae ANTHROPIC_BASE_URL=$SERVER_URL \
    --ae ANTHROPIC_API_KEY='sk-no-key-required' \
    --ae ANTHROPIC_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_OPUS_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_SONNET_MODEL=$MODEL_NAME \
    --ae ANTHROPIC_DEFAULT_HAIKU_MODEL=$MODEL_NAME

Codex vLLM

Set the following variables in your environ:

export SERVER_URL=
export MODEL_NAME=

Use the utility script to create the config.toml file:

uv run scripts/codex_config_toml.py $MODEL_NAME $SERVER_URL

Then run:

harbor run --agent codex -d $BENCHMARK \
    -m vllm/$MODEL_NAME \
    --ae CODEX_HOME=/root/.codex/ \
    --mounts-json '[ { "type": "bind", "source":"/Users/taagarwa/Documents/Projects/coding-agent-bench/config.toml", "target": "/root/.codex/config.toml" } ]'

OpenClaw vLLM

Set the following variables in your environ:

export MODEL_NAME='qwen3.6-35b'
export SERVER_URL='http://qwen36-35b-qwen36-35b.apps.ocp-beta-test.nerc.mghpcc.org'
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'

Then run:

harbor run --agent openclaw -p $DATASET_DIR/swe-bench-verified \
    -m openai/$MODEL_NAME \
    --agent-kwarg thinking=off \
    --n-concurrent 8

OpenCode vLLM

Set the following variables in your environ:

export MODEL_NAME=

Set the content of your OpenCode config in your environ. Remember to replace the <server-url> with your vLLM server url and the <model-name> with your served model name:

export OPENCODE_CONFIG_CONTENT='{"$schema":"https://opencode.ai/config.json","model":"vllm/<model-name>","provider":{"vllm":{"npm":"@ai-sdk/openai-compatible","name":"vLLM","options":{"baseURL":"<server-url>"},"models":{"<model-name>":{"name":"<model-name>","limit":{"context":196500,"output":65500}}}}}}'

Then run:

harbor run --agent opencode -p $DATASET_DIR/swe-bench-verified \
    -m vllm/$MODEL_NAME \
    --ae "OPENCODE_CONFIG_CONTENT=$OPENCODE_CONFIG_CONTENT"

Pi vLLM

Set the following variables in your environ:

export MODEL_NAME=

Create a models.json file with your vLLM server information:

export PI_MODELS_JSON='{ "providers": { "vllm": { "baseUrl": "<server-url>", "api": "openai-completions", "apiKey": "NONE", "models": [{ "id": "gemma4-26b", "name": "<model-name>", "contextWindow": 262000 }] } } }'
echo $PI_MODELS_JSON > models.json

Then run:

harbor run --agent pi -d $BENCHMARK \
    -m vllm/$MODEL_NAME \
    --ae PI_OFFLINE=1 \
    --ae PI_CODING_AGENT_DIR=/root/.pi/agent \
    --mounts-json '[ { "type": "bind", "source":"/path/to/models.json", "target": "/root/.pi/agent/models.json" } ]'

Qwen Code vLLM

Set the following variables in your environ:

export MODEL_NAME=
export SERVER_URL=
export OPENAI_BASE_URL=$SERVER_URL/v1
export OPENAI_API_KEY='NONE'

Then run:

harbor run --agent qwen-coder -d $BENCHMARK \
    -i $DATASET_PATTERN \
    -m $MODEL_NAME

Claude Code VertexAI

Set the following variables in your environ:

export CLOUD_ML_REGION=
export ANTHROPIC_VERTEX_PROJECT_ID=
export ANTHROPIC_MODEL=

Then run:

harbor run --agent claude-code -d $BENCHMARK \
    --ae CLAUDE_CODE_USE_VERTEX=1 \
    --ae CLOUD_ML_REGION=$CLOUD_ML_REGION \
    --ae ANTHROPIC_VERTEX_PROJECT_ID=$ANTHROPIC_VERTEX_PROJECT_ID \
    --ae ANTHROPIC_MODEL=$ANTHROPIC_MODEL \
    --ae GOOGLE_APPLICATION_CREDENTIALS='/app/.config/gcloud/application_default_credentials.json' \
    --mounts-json '["~/.config/gcloud/application_default_credentials.json:/app/.config/gcloud/application_default_credentials.json"]'

Deploy models with vLLM

Check out deploy/qwen-all-in-one.yml for a sample vLLM deployment of RedHatAI/Qwen3.6-35B-A3B-NVFP4.

Apply to your cluster by running:

oc apply -f deploy/qwen-all-in-one.yml

SWE-Bench Acceleration

Use accelerated images for SWE-bench-verified

  1. Download the SWE-Bench-Verified tasks
harbor download swe-bench/swe-bench-verified
  1. Replace images with the accelerated ones from Epoch AI
uv run scripts/replace_swe_bench_images.py <path-to-dataset>

Pre-pull base images

  1. Download the dataset
harbor download <dataset>
  1. Pull all the base images
uv run scripts/pull_images.py <path-to-dataset>

Run with Openshift

Run Tasks in Openshift (Orchestrate Locally)

Login to your cluster and select a project:

oc login --token=<token> --server=<server>
oc project <project>

Create ServiceAccounts and RoleBindings to run tasks:

oc apply -f deploy/harbor-task-sa.yml

Then in your harbor command, add the flag:

--environment-import-path coding_agent_bench.harbor_envs.openshift:OpenshiftEnvironment

Run Tasks and Orchestrate in Openshift

Login to your cluster and select a project:

oc login --token=<token> --server=<server>
oc project <project>

Create ServiceAccounts and RoleBindings to run tasks and orchestrate:

oc apply -f deploy/harbor-task-sa.yml
oc apply -f deploy/harbor-orchestrator-sa.yml

Create a MinIO deployment to store your job results:

oc apply -f deploy/harbor-minio.yml

Using the CLI, start a job with the --remote flag enabled and set --environment openshift, e.g.:

uv run coding-agent-bench \
    --agent claude-code \
    --dataset scale-ai/swe-bench-pro \
    --model-name my-model \
    --server-url http://my.server.url \
    --remote \
    --environment openshift

WIP

Run with Podman

Requires podman on PATH with a running Podman machine.

In your harbor command, add the flag:

--environment-import-path coding_agent_bench.harbor_envs.podman:PodmanEnvironment

Run with Gemini and Gemini CLI

export GOOGLE_CLOUD_PROJECT="<your-project>"

harbor run --agent gemini-cli -d $BENCHMARK \
    -m $MODEL_NAME

Run with vLLM and Gemini CLI

harbor run --agent gemini-cli -d $BENCHMARK \
    --ae GOOGLE_GEMINI_BASE_URL=$SERVER_URL \
    --ae GEMINI_MODEL=$MODEL_NAME \
    -m $MODEL_NAME

About

Reproducible benchmarks for coding agents and models using Harbor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors