feat: regression guard, agent card, recommend command, external datasets by himmi-01 · Pull Request #16 · Corbell-AI/evalmonkey

himmi-01 · 2026-06-02T06:35:23Z

feat(1B): Regression Guard — auto-detect score drops after every run-benchmark; new 'evalmonkey guard' command exits with code 1 for CI/CD gating. Configurable via EVAL_REGRESSION_THRESHOLD env var (default: 5 pts).
feat(1C): Agent Card — 'evalmonkey report' generates shareable Markdown with shields.io badge, per-scenario score table, and production reliability metric. Paste the badge directly into your README.
feat(1A): 'evalmonkey recommend' — reads agent_type from evalmonkey.yaml and shows curated benchmark suite (research_agent, coding_agent, rag_agent, customer_support, voice_agent, safety_agent, general).
feat(phase2): External and Private Datasets
- '--dataset my_cases.jsonl' flag for local JSONL/JSON/CSV files
- 'hf::org/dataset' prefix to load any HuggingFace dataset directly
- 'confident-ai::dataset_id' / 'braintrust::ref' / 'langsmith::id' prefixes to use EvalMonkey as chaos+scoring harness on top of your existing eval platform datasets (API key from .env)
- Generic REST harness via 'private_benchmarks' block in evalmonkey.yaml
docs: add sample agent benchmark cards to assets/agent_cards/
tests: 53 new unit tests (all mocked, no network), 149 total passing

- feat(1B): Regression Guard — auto-detect score drops after every run-benchmark; new 'evalmonkey guard' command exits with code 1 for CI/CD gating. Configurable via EVAL_REGRESSION_THRESHOLD env var (default: 5 pts). - feat(1C): Agent Card — 'evalmonkey report' generates shareable Markdown with shields.io badge, per-scenario score table, and production reliability metric. Paste the badge directly into your README. - feat(1A): 'evalmonkey recommend' — reads agent_type from evalmonkey.yaml and shows curated benchmark suite (research_agent, coding_agent, rag_agent, customer_support, voice_agent, safety_agent, general). - feat(phase2): External and Private Datasets - '--dataset my_cases.jsonl' flag for local JSONL/JSON/CSV files - 'hf::org/dataset' prefix to load any HuggingFace dataset directly - 'confident-ai::dataset_id' / 'braintrust::ref' / 'langsmith::id' prefixes to use EvalMonkey as chaos+scoring harness on top of your existing eval platform datasets (API key from .env) - Generic REST harness via 'private_benchmarks' block in evalmonkey.yaml - docs: add sample agent benchmark cards to assets/agent_cards/ - tests: 53 new unit tests (all mocked, no network), 149 total passing

himmi-01 force-pushed the feat/regression-guard-agent-card-private-datasets branch from 0ad9aab to 251c296 Compare June 2, 2026 06:38

himmi-01 closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: regression guard, agent card, recommend command, external datasets#16

feat: regression guard, agent card, recommend command, external datasets#16
himmi-01 wants to merge 1 commit into
Corbell-AI:mainfrom
himmi-01:feat/regression-guard-agent-card-private-datasets

himmi-01 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himmi-01 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant