docs(evaluation): revamp evals documentation for new eval system by KarthikAvinashFI · Pull Request #648 · future-agi/docs

KarthikAvinashFI · 2026-05-08T09:16:06Z

Summary

Brings the evaluation docs in line with the post-revamp platform. Replaces the old four-method taxonomy (LLM as Judge / Deterministic / Statistical Metric / LLM as Ranker) with what the UI actually shows today: Agents, LLM-As-A-Judge, and Code. Adds new concept and feature pages for things that were undocumented (composite evals, versioning, ground truth, error localization, test playground, data injection, output types in their new label-based form, MCP connectors). Rewrites the trace and simulation eval guides around the actual Tasks and Create a Simulation flows.

Linear: TH-4638, TH-4934

What changed

Concepts (under evaluation/concepts/)

Rewritten: eval-types, eval-templates, eval-results, judge-models, understanding-evaluation.
New: output-types, data-injection, composite-evals, versioning, mcp-connectors.

Features (under evaluation/features/)

Rewritten: custom, evaluate.
New: test-playground, error-localization, ground-truth, mcp-connectors.
Minor: custom-models (added trace projects to the surfaces list).

Cookbooks (under cookbook/evaluation/)

New: eval-with-mcp-connectors — end-to-end CRM lookup example.

Surface-specific eval guides (outside evaluation/)

observe/features/evals rewritten around the Tasks flow (Basic Info / Evaluations / Filters / Scheduling) and the Historical data / New incoming data run modes.

quickstart/running-evals-in-simulation aligned with the 4-step Create a Simulation wizard (Add simulation details, Choose Scenario(s), Select Evaluations, Summary) and updated mapping fields.

Navigation

src/lib/navigation.ts updated to include the new concept, feature, and cookbook pages in the sidebar.

Removed

eval-groups.mdx and all references. The Groups feature is no longer reachable from the main UI navigation.

Style guide compliance

All concept pages start with ## About; no UI walkthrough screenshots in concept pages.
All feature pages have one screenshot placeholder per major step.
No em-dashes, no marketing language, no bold headings.
Internal terms (AgentLoop, Falcon AI Loop, Temporal, Celery, RestrictedPython, nsjail, VLLM, internal class names) do not appear in any doc.

Verification

pnpm audit-links — 0 broken nav links, 0 broken content links.
Every concrete UI claim cross-checked against the live frontend (EvalCreatePage.jsx, EvalPickerConfigFull.jsx, TestPlayground.jsx, etc.) and backend (model_hub/types.py, evaluations/engine/instance.py, ee/evals/llm/agent_evaluator/).

Test plan

pnpm audit-links passes.
pnpm build passes.
Spot-check each new and rewritten page in pnpm dev.
Replace screenshot placeholders before un-drafting.

Aligns the evaluation docs with the post-revamp platform: three eval types (Agents / LLM-As-A-Judge / Code), three output types (Pass/fail / Scoring / Choices), composite templates, versioning, ground truth, error localization, and updated apply flows for datasets, trace projects (now via Tasks), and simulation. Concepts (rewritten / new): - eval-types: 3-type taxonomy matching the create-page tabs - eval-templates: built-in vs custom, single vs composite, versioning - eval-results: result formats per output type - judge-models: Turing models + bring-your-own - understanding-evaluation: surfaces and how it all fits - output-types (new): Pass/fail, Scoring (label-based), Choices - data-injection (new): the six Context options - composite-evals (new): aggregation functions and child axis - versioning (new): Set as Default, Restore Version, pinning Features (rewritten / new): - custom: full create flow for all 3 types with field reference - evaluate: dataset apply flow + SDK - test-playground (new): four source modes, AI generate - error-localization (new): toggle, run lifecycle, SDK - ground-truth (new): upload, mapping, embedding statuses Surface-specific updates: - observe/features/evals: rewritten around the Tasks page flow (Basic Info / Evaluations / Filters / Scheduling) - quickstart/running-evals-in-simulation: aligned with the 4-step Create a Simulation wizard Eval Groups was removed from docs as the feature is no longer exposed in the main UI navigation. TH-4638

Adds reference pages for built-in evals that were missing documentation (deterministic, statistical, and agent-mode templates). Also fixes Detect Hallucination input requirement.

Adds rows for the freshly-generated reference pages so users can find them from the Built-in Evals catalog.

…se error

Cleaned up auto-generated descriptions and parameter tables across 90 built-in eval reference pages. Removed truncated description suffixes, replaced placeholder parameter descriptions (e.g. "The output.") with concrete type and value-shape information.

…k (TH-4934) - Concept: how connectors plug into Agent-mode evals, what the judge sees, cost and latency. - Feature: UI walkthrough for attaching connectors to an eval, troubleshooting table. - Cookbook: end-to-end example using a CRM MCP server to verify support replies. - Nav: register all three under Evaluation / Cookbooks.

KarthikAvinashFI added 6 commits May 8, 2026 14:45

docs(evaluation): generate doc pages for all 75 missing built-in evals

1d038b5

Adds reference pages for built-in evals that were missing documentation (deterministic, statistical, and agent-mode templates). Also fixes Detect Hallucination input requirement.

docs(evaluation): list all 75 newly-added built-in evals in catalog page

af5e7f8

Adds rows for the freshly-generated reference pages so users can find them from the Built-in Evals catalog.

fix(docs): escape angle brackets in F-Beta description to fix MDX par…

2e04ac5

…se error

fix(docs): escape <= in latency-check description to fix MDX parse error

9d16800

KarthikAvinashFI marked this pull request as ready for review May 11, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evaluation): revamp evals documentation for new eval system#648

docs(evaluation): revamp evals documentation for new eval system#648
KarthikAvinashFI wants to merge 7 commits into
devfrom
karthikavinash/th-4638-evals-revamp-doc

KarthikAvinashFI commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KarthikAvinashFI commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Style guide compliance

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KarthikAvinashFI commented May 8, 2026 •

edited

Loading