Install Python 3.9+ from python.org. On Linux, you may need python3 instead of python.
Run from the repo root directory, and install the package first:
pip install -e .Create a .env file from the template:
cp .env.example .env
# Edit .env with your actual API keys- Double-check your API keys in
.env - Ensure the endpoint URLs are correct (no trailing slashes, correct resource names)
- Verify your Azure subscription has access to the Model Router and baseline model deployments
| Endpoint | URL format |
|---|---|
| Model Router | https://<resource>.services.ai.azure.com/models |
| Azure OpenAI (baseline/judge) | https://<resource>.openai.azure.com |
Some newer models (e.g. GPT-5) require max_completion_tokens instead of max_tokens. The client handles this automatically — just use max_tokens in your config and it will be mapped correctly.
Some models only accept temperature=1 (default). If you get a temperature error, remove it from your config:
endpoints:
baseline:
parameters:
max_tokens: 1024
# Don't set temperature for gpt-5You're hitting rate limits. Options:
- Reduce concurrency — lower
max_parallel_requestsin your config - Wait and retry — the built-in retry with exponential backoff handles transient 429s
- Use
--resume— split the run across multiple sessions - Request higher limits — contact Azure support for your deployment
- Check
max_parallel_requests— increase it (e.g. 10) if your endpoint supports it - Use
configs/large_scale.yamlfor production-scale runs - The judge phase takes 4 API calls per prompt — disable it for speed:
judge.enabled: false
Rough estimates at default concurrency:
| Phase | Time |
|---|---|
| Eval (5 parallel) | ~35 minutes |
| Judge (3 parallel) | ~110 minutes |
| Total | ~2.5 hours |
See how-to-resume-and-scale.md for tuning.
- Ensure the judge model is capable (gpt-5 or gpt-4o recommended)
- Check the judge prompt templates in
configs/judge_prompts/— they should match your evaluation criteria - With small samples (< 30), variance is expected
The judge evaluated the same pair twice with swapped ordering and gave different winners. This means the judge's preference was based on position (first vs second) rather than actual quality. The result is recorded as a tie to avoid bias.
Yes. Edit configs/default.yaml:
judge:
endpoint:
deployment_name: "gpt-4o" # or any capable modelEnsure the evaluation completed successfully. If it was interrupted, use --resume to finish it.
The matplotlib package is required. Install it:
pip install matplotlibpython scripts/export_results.py results/my-run --format csv --output results.csvInstall the Foundry extras:
pip install -e ".[foundry]"Requires azure-ai-projects>=2.1.0,<3.0 and azure-identity>=1.15.
Run az login before running the Foundry eval:
az login
# If you have multiple tenants:
az login --tenant YOUR_TENANT_IDAdd it to your .env file. The format is:
AZURE_AI_PROJECT_ENDPOINT=https://<resource>.services.ai.azure.com/api/projects/<project-name>
You can find this URL in the Microsoft Foundry portal under Project settings → Overview.
This error means the SDK version doesn't support stored_evaluator. As of v2.1.0, the valid testing criteria types are:
score_model— LLM-based gradingpython— code-based custom evaluator (used for cost/latency graders)label_model— classificationstring_check— regex/exact matchtext_similarity— embedding-based
The tool uses score_model for quality graders and python for cost/latency graders. No evaluator registration is needed.
Use jsonl instead. The file_content type was removed in azure-ai-projects v2.1.0. The correct data source format is:
data_source={"type": "jsonl", "source": {"type": "file_id", "id": file_id}}The result_counts field from the SDK is a typed object, not a plain dict. Convert it with model_dump() before iterating:
counts = run_result.result_counts.model_dump() # → {"passed": 4, "failed": 6, ...}The live API returns results as a list of grader dicts, not a dict keyed by grader name:
{"results": [{"name": "quality_absolute_router", "score": 4.0, "passed": true}, ...]}If you see this, your report.py may be expecting the older dict format. The current code handles both formats.
This happens when the baseline model returns an empty response for a prompt. An empty response is scored 1/5 on all quality dimensions (Poor: incomplete, not useful). This is a data issue — the baseline model failed to generate output — not a bug in the scoring.
To investigate, check the foundry_input.jsonl file for empty baseline_response fields.
The score_model graders call the judge model 3 times per prompt (router absolute, baseline absolute, pairwise). For 10 prompts, that's 30 LLM calls plus 20 Python grader executions. Typical time is 1–3 minutes.
If it's much slower:
- Check your endpoint's rate limits and quotas
- The
AZURE_AI_MODEL_DEPLOYMENT_NAMEmodel must have sufficient capacity - Try
--skip-qualityto run only cost/latency graders (instant)
python scripts/cross_validate.pyThis compares results/full-eval/results.json against results/foundry-eval/results.json and prints a correlation table. Both should agree on quality direction, cost savings (within ~5%), and latency trends.
Open the CSV in Excel or Google Sheets.
Use forward slashes or raw strings in config files:
dataset: "datasets/my_prompts.jsonl" # Forward slashes work on all platformsYes. Setup scripts are provided for both:
# macOS / Linux
bash scripts/setup.sh
bash scripts/demo.sh
# Windows
.\scripts\setup.ps1
.\scripts\demo.ps1