-
Notifications
You must be signed in to change notification settings - Fork 0
feat(trend): multi-run score trend analyzer — detect behavioral drift across sequential eval runs #913
Description
Summary
AgentV's compare command does excellent pairwise A/B comparison between two runs. What's missing: detecting whether scores are trending up or down across N sequential runs.
A 10-run regression where each run scores slightly lower than the previous is completely invisible to pairwise comparison — but it's exactly the failure mode that matters in production agent deployment.
Concrete scenario
You're running weekly eval sweeps on evals/code-review.yaml against claude-sonnet. Each week's results live in .agentv/results/runs/. After 8 weeks:
week-1: 0.92 week-2: 0.91 week-3: 0.89 week-4: 0.88
week-5: 0.86 week-6: 0.85 week-7: 0.83 week-8: 0.82
agentv compare week-7/index.jsonl week-8/index.jsonl → tiny delta, within threshold → no alert.
There's no command that says: "over these 8 runs, slope = -0.014/run, direction: DEGRADING, regression detected."
Proposed: agentv trend command
agentv trend --last 10 # most recent N runs
agentv trend --last 10 --eval code-review # filter by eval
agentv trend run-01/ run-02/ run-03/ # explicit pathsTrend Analysis — 8 runs (2026-03-01 → 2026-03-22)
Eval: code-review | Target: claude-sonnet
Slope: -0.014/run
Direction: DEGRADING ⚠️
Verdict: REGRESSION DETECTED (threshold: ±0.01/run)
Implementation sketch
The data is all there — ResultManifestRecord.score, .timestamp, .eval_id, .target are already in the JSONL manifest files. A TrendAnalyzer would:
- Load N run directories sorted by
timestamp - Compute mean score per run (already done in
results summary) - OLS via
statistics.linearRegression(stdlib, zero new deps) - Return
{ slope, direction, anyRegression }+ rich table output
CI gate integration:
agentv trend --last 10 --threshold 0.01 || exit 1This is the proactive version of compare's reactive A/B check. Pairwise comparison tells you "did this run get worse than last run?" Trend analysis tells you "has this agent been getting progressively worse for 10 runs?"
Related
This pattern applies directly to AgentV Studio's quality gates (#788) — a trend gate is more robust than a single-run threshold gate.
Cross-session behavioral drift gap documented across 65+ independent agent tools in: PDR in Production v2.6
Metadata
Metadata
Assignees
Labels
Type
Projects
Status