Skip to content

feat: regression guard, agent card, recommend command, external datasets#16

Closed
himmi-01 wants to merge 1 commit into
Corbell-AI:mainfrom
himmi-01:feat/regression-guard-agent-card-private-datasets
Closed

feat: regression guard, agent card, recommend command, external datasets#16
himmi-01 wants to merge 1 commit into
Corbell-AI:mainfrom
himmi-01:feat/regression-guard-agent-card-private-datasets

Conversation

@himmi-01
Copy link
Copy Markdown
Contributor

@himmi-01 himmi-01 commented Jun 2, 2026

  • feat(1B): Regression Guard — auto-detect score drops after every run-benchmark; new 'evalmonkey guard' command exits with code 1 for CI/CD gating. Configurable via EVAL_REGRESSION_THRESHOLD env var (default: 5 pts).

  • feat(1C): Agent Card — 'evalmonkey report' generates shareable Markdown with shields.io badge, per-scenario score table, and production reliability metric. Paste the badge directly into your README.

  • feat(1A): 'evalmonkey recommend' — reads agent_type from evalmonkey.yaml and shows curated benchmark suite (research_agent, coding_agent, rag_agent, customer_support, voice_agent, safety_agent, general).

  • feat(phase2): External and Private Datasets

    • '--dataset my_cases.jsonl' flag for local JSONL/JSON/CSV files
    • 'hf::org/dataset' prefix to load any HuggingFace dataset directly
    • 'confident-ai::dataset_id' / 'braintrust::ref' / 'langsmith::id' prefixes to use EvalMonkey as chaos+scoring harness on top of your existing eval platform datasets (API key from .env)
    • Generic REST harness via 'private_benchmarks' block in evalmonkey.yaml
  • docs: add sample agent benchmark cards to assets/agent_cards/

  • tests: 53 new unit tests (all mocked, no network), 149 total passing

- feat(1B): Regression Guard — auto-detect score drops after every run-benchmark;
  new 'evalmonkey guard' command exits with code 1 for CI/CD gating.
  Configurable via EVAL_REGRESSION_THRESHOLD env var (default: 5 pts).

- feat(1C): Agent Card — 'evalmonkey report' generates shareable Markdown with
  shields.io badge, per-scenario score table, and production reliability metric.
  Paste the badge directly into your README.

- feat(1A): 'evalmonkey recommend' — reads agent_type from evalmonkey.yaml and
  shows curated benchmark suite (research_agent, coding_agent, rag_agent,
  customer_support, voice_agent, safety_agent, general).

- feat(phase2): External and Private Datasets
  - '--dataset my_cases.jsonl' flag for local JSONL/JSON/CSV files
  - 'hf::org/dataset' prefix to load any HuggingFace dataset directly
  - 'confident-ai::dataset_id' / 'braintrust::ref' / 'langsmith::id' prefixes
    to use EvalMonkey as chaos+scoring harness on top of your existing eval
    platform datasets (API key from .env)
  - Generic REST harness via 'private_benchmarks' block in evalmonkey.yaml

- docs: add sample agent benchmark cards to assets/agent_cards/

- tests: 53 new unit tests (all mocked, no network), 149 total passing
@himmi-01 himmi-01 force-pushed the feat/regression-guard-agent-card-private-datasets branch from 0ad9aab to 251c296 Compare June 2, 2026 06:38
@himmi-01 himmi-01 closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant