Integration-Automation · JE-Chen · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 
 ## Table of Contents
 
+- [What's new (2026-06-19) — Agent Trajectory Evaluation](#whats-new-2026-06-19--agent-trajectory-evaluation)
 - [What's new (2026-06-19) — Approval Testing (Golden-Master Baselines)](#whats-new-2026-06-19--approval-testing-golden-master-baselines)
 - [What's new (2026-06-19) — Network Egress Allowlist Guard](#whats-new-2026-06-19--network-egress-allowlist-guard)
 - [What's new (2026-06-19) — Just-In-Time Credential Leases](#whats-new-2026-06-19--just-in-time-credential-leases)
@@ -88,6 +89,12 @@
 
 ---
 
+## What's new (2026-06-19) — Agent Trajectory Evaluation
+
+Score an agent run against a rubric. Full reference: [`docs/source/Eng/doc/new_features/v36_features_doc.rst`](docs/source/Eng/doc/new_features/v36_features_doc.rst).
+
+- **`evaluate_trajectory`** (`AC_evaluate_trajectory`, `ac_evaluate_trajectory`): scores a recorded trajectory (ordered `{action, args, observation}` steps) against a declarative rubric — `required_actions` (+`ordered`), `forbidden_actions`, `max_steps`, `success_contains`. Returns `{passed, score, steps, checks}` where `score` is the fraction of applicable checks passed and each `check` pinpoints a violated expectation. A deterministic, dependency-free signal for agent regression testing; the rubric is plain data so it lives in JSON action files and travels over MCP.
+
 ## What's new (2026-06-19) — Approval Testing (Golden-Master Baselines)
 
 Lock outputs against a human-approved baseline. Full reference: [`docs/source/Eng/doc/new_features/v35_features_doc.rst`](docs/source/Eng/doc/new_features/v35_features_doc.rst).

diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md
@@ -12,6 +12,7 @@
 
 ## 目录
 
+- [本次更新 (2026-06-19) — Agent 轨迹评估](#本次更新-2026-06-19--agent-轨迹评估)
 - [本次更新 (2026-06-19) — 核准式测试(Golden-Master 基准)](#本次更新-2026-06-19--核准式测试golden-master-基准)
 - [本次更新 (2026-06-19) — 网络出口允许清单守卫](#本次更新-2026-06-19--网络出口允许清单守卫)
 - [本次更新 (2026-06-19) — 即时凭证租约](#本次更新-2026-06-19--即时凭证租约)
@@ -87,6 +88,12 @@
 
 ---
 
+## 本次更新 (2026-06-19) — Agent 轨迹评估
+
+依评分标准为 agent 运行评分。完整参考:[`docs/source/Zh/doc/new_features/v36_features_doc.rst`](../docs/source/Zh/doc/new_features/v36_features_doc.rst)。
+
+- **`evaluate_trajectory`**(`AC_evaluate_trajectory`、`ac_evaluate_trajectory`):依声明式评分标准 —— `required_actions`(+`ordered`)、`forbidden_actions`、`max_steps`、`success_contains` —— 为一次记录的轨迹(有序 `{action, args, observation}` 步骤)评分。返回 `{passed, score, steps, checks}`,其中 `score` 为通过的适用检查占比,每个 `check` 精准指出被违反的期望。为 agent 回归测试提供确定性、无依赖的信号;rubric 为纯数据,可存于 JSON action 文件并经 MCP 传递。
+
 ## 本次更新 (2026-06-19) — 核准式测试(Golden-Master 基准)
 
 将输出锁定到人工核准的基准。完整参考:[`docs/source/Zh/doc/new_features/v35_features_doc.rst`](../docs/source/Zh/doc/new_features/v35_features_doc.rst)。

diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md
@@ -12,6 +12,7 @@
 
 ## 目錄
 
+- [本次更新 (2026-06-19) — Agent 軌跡評估](#本次更新-2026-06-19--agent-軌跡評估)
 - [本次更新 (2026-06-19) — 核准式測試(Golden-Master 基準)](#本次更新-2026-06-19--核准式測試golden-master-基準)
 - [本次更新 (2026-06-19) — 網路出口允許清單守衛](#本次更新-2026-06-19--網路出口允許清單守衛)
 - [本次更新 (2026-06-19) — 即時憑證租約](#本次更新-2026-06-19--即時憑證租約)
@@ -87,6 +88,12 @@
 
 ---
 
+## 本次更新 (2026-06-19) — Agent 軌跡評估
+
+依評分標準為 agent 執行評分。完整參考:[`docs/source/Zh/doc/new_features/v36_features_doc.rst`](../docs/source/Zh/doc/new_features/v36_features_doc.rst)。
+
+- **`evaluate_trajectory`**(`AC_evaluate_trajectory`、`ac_evaluate_trajectory`):依宣告式評分標準 —— `required_actions`(+`ordered`)、`forbidden_actions`、`max_steps`、`success_contains` —— 為一次記錄的軌跡(有序 `{action, args, observation}` 步驟)評分。回傳 `{passed, score, steps, checks}`,其中 `score` 為通過的適用檢查佔比,每個 `check` 精準指出被違反的期望。為 agent 回歸測試提供確定性、無相依的訊號;rubric 為純資料,可存於 JSON action 檔並經 MCP 傳遞。
+
 ## 本次更新 (2026-06-19) — 核准式測試(Golden-Master 基準)
 
 將輸出鎖定到人工核准的基準。完整參考:[`docs/source/Zh/doc/new_features/v35_features_doc.rst`](../docs/source/Zh/doc/new_features/v35_features_doc.rst)。

diff --git a/docs/source/Eng/doc/new_features/v36_features_doc.rst b/docs/source/Eng/doc/new_features/v36_features_doc.rst
@@ -0,0 +1,56 @@
+Agent Trajectory Evaluation
+===========================
+
+As automations hand control to LLM agents, "did it still work?" becomes "did
+the agent take an acceptable path?". :func:`evaluate_trajectory` scores a
+recorded run against a declarative **rubric**, giving a deterministic,
+dependency-free signal for agent regression testing.
+
+A *trajectory* is the ordered list of steps a run took — each a dict with at
+least an ``"action"`` name and optionally ``"args"`` / ``"observation"``. The
+*rubric* is plain data (so it lives happily in a JSON action file or arrives
+over MCP):
+
+================================ ===================================================
+Rubric key                       Meaning
+================================ ===================================================
+``required_actions``             Actions that must all appear.
+``ordered``                      With the above, also require that relative order.
+``forbidden_actions``            Actions that must never appear.
+``max_steps``                    Upper bound on trajectory length.
+``success_contains``             Substring that must appear in some observation.
+================================ ===================================================
+
+Headless API
+------------
+
+.. code-block:: python
+
+    from je_auto_control import evaluate_trajectory
+
+    trajectory = [
+        {"action": "AC_focus_window", "observation": "focused"},
+        {"action": "AC_type_text", "observation": "typed"},
+        {"action": "AC_click_mouse", "observation": "Saved successfully"},
+    ]
+    result = evaluate_trajectory(trajectory, {
+        "required_actions": ["AC_type_text", "AC_click_mouse"],
+        "forbidden_actions": ["AC_kill_process"],
+        "max_steps": 10,
+        "success_contains": "Saved",
+    })
+    assert result["passed"]            # every applicable check passed
+    print(result["score"], result["checks"])
+
+``score`` is the fraction of applicable checks that passed; ``passed`` is true
+only when all pass; an empty rubric trivially passes. Each entry in ``checks``
+is ``{name, passed, detail}`` so a failure pinpoints the violated expectation.
+
+Executor command
+----------------
+
+``AC_evaluate_trajectory`` takes ``trajectory`` and ``rubric`` (each a JSON
+string from the visual builder, or already-decoded data from a JSON action file
+/ MCP) and returns ``{passed, score, steps, checks}``. The same operation is
+exposed as the MCP tool ``ac_evaluate_trajectory`` and as a Script Builder
+command under **Agent**.
diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst
@@ -58,6 +58,7 @@ Comprehensive guides for all AutoControl features.
    doc/new_features/v33_features_doc
    doc/new_features/v34_features_doc
    doc/new_features/v35_features_doc
+   doc/new_features/v36_features_doc
    doc/ocr_backends/ocr_backends_doc
    doc/observability/observability_doc
    doc/operations_layer/operations_layer_doc

diff --git a/docs/source/Zh/doc/new_features/v36_features_doc.rst b/docs/source/Zh/doc/new_features/v36_features_doc.rst
@@ -0,0 +1,52 @@
+Agent 軌跡評估
+==============
+
+當自動化把控制權交給 LLM agent,「它還能運作嗎?」就變成「agent 是否走了可接受的路
+徑?」。:func:`evaluate_trajectory` 依宣告式**評分標準(rubric)**為一次記錄的執行評
+分,為 agent 回歸測試提供確定性、無相依的訊號。
+
+*軌跡(trajectory)*是該次執行所採取步驟的有序清單 —— 每步是一個至少含 ``"action"``
+名稱、可選含 ``"args"`` / ``"observation"`` 的 dict。*評分標準*為純資料(因此可自在地
+存於 JSON action 檔或經 MCP 傳入):
+
+================================ ===================================================
+Rubric 鍵                        意義
+================================ ===================================================
+``required_actions``             必須全部出現的動作。
+``ordered``                      搭配上者,還要求其相對順序。
+``forbidden_actions``            絕不可出現的動作。
+``max_steps``                    軌跡長度上限。
+``success_contains``             必須出現在某個 observation 中的子字串。
+================================ ===================================================
+
+無頭 API
+--------
+
+.. code-block:: python
+
+    from je_auto_control import evaluate_trajectory
+
+    trajectory = [
+        {"action": "AC_focus_window", "observation": "focused"},
+        {"action": "AC_type_text", "observation": "typed"},
+        {"action": "AC_click_mouse", "observation": "Saved successfully"},
+    ]
+    result = evaluate_trajectory(trajectory, {
+        "required_actions": ["AC_type_text", "AC_click_mouse"],
+        "forbidden_actions": ["AC_kill_process"],
+        "max_steps": 10,
+        "success_contains": "Saved",
+    })
+    assert result["passed"]            # 所有適用的檢查都通過
+    print(result["score"], result["checks"])
+
+``score`` 為通過的適用檢查佔比;``passed`` 僅在全部通過時為真;空 rubric 直接通過。
+``checks`` 中每個項目為 ``{name, passed, detail}``,因此失敗時可精準指出被違反的期望。
+
+執行器指令
+----------
+
+``AC_evaluate_trajectory`` 接受 ``trajectory`` 與 ``rubric``(從視覺化建構器傳入時為
+JSON 字串,從 JSON action 檔 / MCP 傳入時為已解碼資料),回傳
+``{passed, score, steps, checks}``。相同操作亦提供為 MCP 工具
+``ac_evaluate_trajectory``,以及 Script Builder 中 **Agent** 分類下的指令。
diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst
@@ -58,6 +58,7 @@ AutoControl 所有功能的完整使用指南。
    doc/new_features/v33_features_doc
    doc/new_features/v34_features_doc
    doc/new_features/v35_features_doc
+   doc/new_features/v36_features_doc
    doc/ocr_backends/ocr_backends_doc
    doc/observability/observability_doc
    doc/operations_layer/operations_layer_doc

diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py
@@ -216,6 +216,8 @@
 from je_auto_control.utils.approval import (
     ApprovalResult, approve_artifact, pending_artifacts, verify_artifact,
 )
+# Agent trajectory evaluation: score a recorded run against a rubric
+from je_auto_control.utils.trajectory_eval import evaluate_trajectory
 # Background popup/interrupt watchdog (unattended automation)
 from je_auto_control.utils.watchdog import (
     PopupWatchdog, WatchdogRule, default_popup_watchdog,
@@ -656,6 +658,7 @@ def start_autocontrol_gui(*args, **kwargs):
     "EgressBlocked", "EgressPolicy", "get_egress_policy", "set_egress_policy",
     "ApprovalResult", "approve_artifact", "pending_artifacts",
     "verify_artifact",
+    "evaluate_trajectory",
     # MCP server
     "AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
     "MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",

diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py
@@ -801,6 +801,16 @@ def _add_misc_specs(specs: List[CommandSpec]) -> None:
                           default=".approvals"),),
         description="List artifacts awaiting approval.",
     ))
+    specs.append(CommandSpec(
+        "AC_evaluate_trajectory", "Agent", "Evaluate Trajectory",
+        fields=(
+            FieldSpec("trajectory", FieldType.STRING,
+                      placeholder='[{"action": "AC_click_mouse"}]'),
+            FieldSpec("rubric", FieldType.STRING,
+                      placeholder='{"required_actions": ["AC_type_text"]}'),
+        ),
+        description="Score an agent trajectory against a rubric (JSON).",
+    ))
     specs.append(CommandSpec(
         "AC_generate_sop", "Report", "Generate SOP Document",
         fields=(

diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py
@@ -2991,6 +2991,21 @@ def _pending_artifacts(approvals_dir: str = ".approvals") -> Dict[str, Any]:
     return {"pending": pending_artifacts(approvals_dir)}
 
 
+def _evaluate_trajectory(trajectory: Any, rubric: Any) -> Dict[str, Any]:
+    """Adapter: score an agent trajectory against a declarative rubric.
+
+    ``trajectory`` / ``rubric`` may be JSON strings (from the visual builder)
+    or already-decoded list/dict (from JSON action files / MCP).
+    """
+    import json
+    from je_auto_control.utils.trajectory_eval import evaluate_trajectory
+    if isinstance(trajectory, str):
+        trajectory = json.loads(trajectory)
+    if isinstance(rubric, str):
+        rubric = json.loads(rubric)
+    return evaluate_trajectory(trajectory, rubric)
+
+
 class Executor:
     """
     Executor
@@ -3236,6 +3251,7 @@ def __init__(self):
             "AC_verify_artifact": _verify_artifact,
             "AC_approve_artifact": _approve_artifact,
             "AC_pending_artifacts": _pending_artifacts,
+            "AC_evaluate_trajectory": _evaluate_trajectory,
             "AC_a11y_record_start": _a11y_record_start,
             "AC_a11y_record_stop": _a11y_record_stop,
             "AC_a11y_record_events": _a11y_record_events,

diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py
@@ -2729,6 +2729,27 @@ def approval_testing_tools() -> List[MCPTool]:
     ]
 
 
+def trajectory_eval_tools() -> List[MCPTool]:
+    return [
+        MCPTool(
+            name="ac_evaluate_trajectory",
+            description=("Score an agent's recorded 'trajectory' (a list of "
+                         "{action, args, observation} steps) against a 'rubric' "
+                         "with optional keys required_actions (+ordered), "
+                         "forbidden_actions, max_steps, success_contains. "
+                         "Returns {passed, score, steps, checks} for agent "
+                         "regression testing."),
+            input_schema=schema(
+                {"trajectory": {"type": "array",
+                                "items": {"type": "object"}},
+                 "rubric": {"type": "object"}},
+                ["trajectory", "rubric"]),
+            handler=h.evaluate_trajectory,
+            annotations=READ_ONLY,
+        ),
+    ]
+
+
 def unattended_tools() -> List[MCPTool]:
     return [
         MCPTool(
@@ -3788,6 +3809,7 @@ def media_assert_tools() -> List[MCPTool]:
     ci_annotation_tools, clipboard_history_tools, audit_analysis_tools,
     process_doc_tools, tween_drag_tools, plugin_sdk_tools, governance_tools,
     credential_lease_tools, egress_tools, approval_testing_tools,
+    trajectory_eval_tools,
     screen_record_tools,
     process_and_shell_tools, remote_desktop_tools, gamepad_tools,
     usb_passthrough_tools, assertion_tools, data_source_tools,

diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py
@@ -1325,6 +1325,12 @@ def pending_artifacts(approvals_dir: str = ".approvals"):
     return {"pending": _pending(approvals_dir)}
 
 
+def evaluate_trajectory(trajectory, rubric):
+    from je_auto_control.utils.trajectory_eval import (
+        evaluate_trajectory as _evaluate)
+    return _evaluate(trajectory, rubric)
+
+
 def vlm_locate(description: str,
                screen_region: Optional[List[int]] = None,
                model: Optional[str] = None) -> Optional[List[int]]:

diff --git a/je_auto_control/utils/trajectory_eval/__init__.py b/je_auto_control/utils/trajectory_eval/__init__.py
@@ -0,0 +1,6 @@
+"""Agent trajectory evaluation: score a recorded run against a rubric."""
+from je_auto_control.utils.trajectory_eval.trajectory_eval import (
+    evaluate_trajectory,
+)
+
+__all__ = ["evaluate_trajectory"]