Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-19) — Agent Trajectory Evaluation](#whats-new-2026-06-19--agent-trajectory-evaluation)
- [What's new (2026-06-19) — Approval Testing (Golden-Master Baselines)](#whats-new-2026-06-19--approval-testing-golden-master-baselines)
- [What's new (2026-06-19) — Network Egress Allowlist Guard](#whats-new-2026-06-19--network-egress-allowlist-guard)
- [What's new (2026-06-19) — Just-In-Time Credential Leases](#whats-new-2026-06-19--just-in-time-credential-leases)
Expand Down Expand Up @@ -88,6 +89,12 @@

---

## What's new (2026-06-19) — Agent Trajectory Evaluation

Score an agent run against a rubric. Full reference: [`docs/source/Eng/doc/new_features/v36_features_doc.rst`](docs/source/Eng/doc/new_features/v36_features_doc.rst).

- **`evaluate_trajectory`** (`AC_evaluate_trajectory`, `ac_evaluate_trajectory`): scores a recorded trajectory (ordered `{action, args, observation}` steps) against a declarative rubric — `required_actions` (+`ordered`), `forbidden_actions`, `max_steps`, `success_contains`. Returns `{passed, score, steps, checks}` where `score` is the fraction of applicable checks passed and each `check` pinpoints a violated expectation. A deterministic, dependency-free signal for agent regression testing; the rubric is plain data so it lives in JSON action files and travels over MCP.

## What's new (2026-06-19) — Approval Testing (Golden-Master Baselines)

Lock outputs against a human-approved baseline. Full reference: [`docs/source/Eng/doc/new_features/v35_features_doc.rst`](docs/source/Eng/doc/new_features/v35_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-19) — Agent 轨迹评估](#本次更新-2026-06-19--agent-轨迹评估)
- [本次更新 (2026-06-19) — 核准式测试(Golden-Master 基准)](#本次更新-2026-06-19--核准式测试golden-master-基准)
- [本次更新 (2026-06-19) — 网络出口允许清单守卫](#本次更新-2026-06-19--网络出口允许清单守卫)
- [本次更新 (2026-06-19) — 即时凭证租约](#本次更新-2026-06-19--即时凭证租约)
Expand Down Expand Up @@ -87,6 +88,12 @@

---

## 本次更新 (2026-06-19) — Agent 轨迹评估

依评分标准为 agent 运行评分。完整参考:[`docs/source/Zh/doc/new_features/v36_features_doc.rst`](../docs/source/Zh/doc/new_features/v36_features_doc.rst)。

- **`evaluate_trajectory`**(`AC_evaluate_trajectory`、`ac_evaluate_trajectory`):依声明式评分标准 —— `required_actions`(+`ordered`)、`forbidden_actions`、`max_steps`、`success_contains` —— 为一次记录的轨迹(有序 `{action, args, observation}` 步骤)评分。返回 `{passed, score, steps, checks}`,其中 `score` 为通过的适用检查占比,每个 `check` 精准指出被违反的期望。为 agent 回归测试提供确定性、无依赖的信号;rubric 为纯数据,可存于 JSON action 文件并经 MCP 传递。

## 本次更新 (2026-06-19) — 核准式测试(Golden-Master 基准)

将输出锁定到人工核准的基准。完整参考:[`docs/source/Zh/doc/new_features/v35_features_doc.rst`](../docs/source/Zh/doc/new_features/v35_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-19) — Agent 軌跡評估](#本次更新-2026-06-19--agent-軌跡評估)
- [本次更新 (2026-06-19) — 核准式測試(Golden-Master 基準)](#本次更新-2026-06-19--核准式測試golden-master-基準)
- [本次更新 (2026-06-19) — 網路出口允許清單守衛](#本次更新-2026-06-19--網路出口允許清單守衛)
- [本次更新 (2026-06-19) — 即時憑證租約](#本次更新-2026-06-19--即時憑證租約)
Expand Down Expand Up @@ -87,6 +88,12 @@

---

## 本次更新 (2026-06-19) — Agent 軌跡評估

依評分標準為 agent 執行評分。完整參考:[`docs/source/Zh/doc/new_features/v36_features_doc.rst`](../docs/source/Zh/doc/new_features/v36_features_doc.rst)。

- **`evaluate_trajectory`**(`AC_evaluate_trajectory`、`ac_evaluate_trajectory`):依宣告式評分標準 —— `required_actions`(+`ordered`)、`forbidden_actions`、`max_steps`、`success_contains` —— 為一次記錄的軌跡(有序 `{action, args, observation}` 步驟)評分。回傳 `{passed, score, steps, checks}`,其中 `score` 為通過的適用檢查佔比,每個 `check` 精準指出被違反的期望。為 agent 回歸測試提供確定性、無相依的訊號;rubric 為純資料,可存於 JSON action 檔並經 MCP 傳遞。

## 本次更新 (2026-06-19) — 核准式測試(Golden-Master 基準)

將輸出鎖定到人工核准的基準。完整參考:[`docs/source/Zh/doc/new_features/v35_features_doc.rst`](../docs/source/Zh/doc/new_features/v35_features_doc.rst)。
Expand Down
56 changes: 56 additions & 0 deletions docs/source/Eng/doc/new_features/v36_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
Agent Trajectory Evaluation
===========================

As automations hand control to LLM agents, "did it still work?" becomes "did
the agent take an acceptable path?". :func:`evaluate_trajectory` scores a
recorded run against a declarative **rubric**, giving a deterministic,
dependency-free signal for agent regression testing.

A *trajectory* is the ordered list of steps a run took — each a dict with at
least an ``"action"`` name and optionally ``"args"`` / ``"observation"``. The
*rubric* is plain data (so it lives happily in a JSON action file or arrives
over MCP):

================================ ===================================================
Rubric key Meaning
================================ ===================================================
``required_actions`` Actions that must all appear.
``ordered`` With the above, also require that relative order.
``forbidden_actions`` Actions that must never appear.
``max_steps`` Upper bound on trajectory length.
``success_contains`` Substring that must appear in some observation.
================================ ===================================================

Headless API
------------

.. code-block:: python

from je_auto_control import evaluate_trajectory

trajectory = [
{"action": "AC_focus_window", "observation": "focused"},
{"action": "AC_type_text", "observation": "typed"},
{"action": "AC_click_mouse", "observation": "Saved successfully"},
]
result = evaluate_trajectory(trajectory, {
"required_actions": ["AC_type_text", "AC_click_mouse"],
"forbidden_actions": ["AC_kill_process"],
"max_steps": 10,
"success_contains": "Saved",
})
assert result["passed"] # every applicable check passed
print(result["score"], result["checks"])

``score`` is the fraction of applicable checks that passed; ``passed`` is true
only when all pass; an empty rubric trivially passes. Each entry in ``checks``
is ``{name, passed, detail}`` so a failure pinpoints the violated expectation.

Executor command
----------------

``AC_evaluate_trajectory`` takes ``trajectory`` and ``rubric`` (each a JSON
string from the visual builder, or already-decoded data from a JSON action file
/ MCP) and returns ``{passed, score, steps, checks}``. The same operation is
exposed as the MCP tool ``ac_evaluate_trajectory`` and as a Script Builder
command under **Agent**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v33_features_doc
doc/new_features/v34_features_doc
doc/new_features/v35_features_doc
doc/new_features/v36_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
52 changes: 52 additions & 0 deletions docs/source/Zh/doc/new_features/v36_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Agent 軌跡評估
==============

當自動化把控制權交給 LLM agent,「它還能運作嗎?」就變成「agent 是否走了可接受的路
徑?」。:func:`evaluate_trajectory` 依宣告式**評分標準(rubric)**為一次記錄的執行評
分,為 agent 回歸測試提供確定性、無相依的訊號。

*軌跡(trajectory)*是該次執行所採取步驟的有序清單 —— 每步是一個至少含 ``"action"``
名稱、可選含 ``"args"`` / ``"observation"`` 的 dict。*評分標準*為純資料(因此可自在地
存於 JSON action 檔或經 MCP 傳入):

================================ ===================================================
Rubric 鍵 意義
================================ ===================================================
``required_actions`` 必須全部出現的動作。
``ordered`` 搭配上者,還要求其相對順序。
``forbidden_actions`` 絕不可出現的動作。
``max_steps`` 軌跡長度上限。
``success_contains`` 必須出現在某個 observation 中的子字串。
================================ ===================================================

無頭 API
--------

.. code-block:: python

from je_auto_control import evaluate_trajectory

trajectory = [
{"action": "AC_focus_window", "observation": "focused"},
{"action": "AC_type_text", "observation": "typed"},
{"action": "AC_click_mouse", "observation": "Saved successfully"},
]
result = evaluate_trajectory(trajectory, {
"required_actions": ["AC_type_text", "AC_click_mouse"],
"forbidden_actions": ["AC_kill_process"],
"max_steps": 10,
"success_contains": "Saved",
})
assert result["passed"] # 所有適用的檢查都通過
print(result["score"], result["checks"])

``score`` 為通過的適用檢查佔比;``passed`` 僅在全部通過時為真;空 rubric 直接通過。
``checks`` 中每個項目為 ``{name, passed, detail}``,因此失敗時可精準指出被違反的期望。

執行器指令
----------

``AC_evaluate_trajectory`` 接受 ``trajectory`` 與 ``rubric``(從視覺化建構器傳入時為
JSON 字串,從 JSON action 檔 / MCP 傳入時為已解碼資料),回傳
``{passed, score, steps, checks}``。相同操作亦提供為 MCP 工具
``ac_evaluate_trajectory``,以及 Script Builder 中 **Agent** 分類下的指令。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v33_features_doc
doc/new_features/v34_features_doc
doc/new_features/v35_features_doc
doc/new_features/v36_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
3 changes: 3 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,8 @@
from je_auto_control.utils.approval import (
ApprovalResult, approve_artifact, pending_artifacts, verify_artifact,
)
# Agent trajectory evaluation: score a recorded run against a rubric
from je_auto_control.utils.trajectory_eval import evaluate_trajectory
# Background popup/interrupt watchdog (unattended automation)
from je_auto_control.utils.watchdog import (
PopupWatchdog, WatchdogRule, default_popup_watchdog,
Expand Down Expand Up @@ -656,6 +658,7 @@ def start_autocontrol_gui(*args, **kwargs):
"EgressBlocked", "EgressPolicy", "get_egress_policy", "set_egress_policy",
"ApprovalResult", "approve_artifact", "pending_artifacts",
"verify_artifact",
"evaluate_trajectory",
# MCP server
"AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
"MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -801,6 +801,16 @@ def _add_misc_specs(specs: List[CommandSpec]) -> None:
default=".approvals"),),
description="List artifacts awaiting approval.",
))
specs.append(CommandSpec(
"AC_evaluate_trajectory", "Agent", "Evaluate Trajectory",
fields=(
FieldSpec("trajectory", FieldType.STRING,
placeholder='[{"action": "AC_click_mouse"}]'),
FieldSpec("rubric", FieldType.STRING,
placeholder='{"required_actions": ["AC_type_text"]}'),
),
description="Score an agent trajectory against a rubric (JSON).",
))
specs.append(CommandSpec(
"AC_generate_sop", "Report", "Generate SOP Document",
fields=(
Expand Down
16 changes: 16 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2991,6 +2991,21 @@ def _pending_artifacts(approvals_dir: str = ".approvals") -> Dict[str, Any]:
return {"pending": pending_artifacts(approvals_dir)}


def _evaluate_trajectory(trajectory: Any, rubric: Any) -> Dict[str, Any]:
"""Adapter: score an agent trajectory against a declarative rubric.

``trajectory`` / ``rubric`` may be JSON strings (from the visual builder)
or already-decoded list/dict (from JSON action files / MCP).
"""
import json
from je_auto_control.utils.trajectory_eval import evaluate_trajectory
if isinstance(trajectory, str):
trajectory = json.loads(trajectory)
if isinstance(rubric, str):
rubric = json.loads(rubric)
return evaluate_trajectory(trajectory, rubric)


class Executor:
"""
Executor
Expand Down Expand Up @@ -3236,6 +3251,7 @@ def __init__(self):
"AC_verify_artifact": _verify_artifact,
"AC_approve_artifact": _approve_artifact,
"AC_pending_artifacts": _pending_artifacts,
"AC_evaluate_trajectory": _evaluate_trajectory,
"AC_a11y_record_start": _a11y_record_start,
"AC_a11y_record_stop": _a11y_record_stop,
"AC_a11y_record_events": _a11y_record_events,
Expand Down
22 changes: 22 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -2729,6 +2729,27 @@ def approval_testing_tools() -> List[MCPTool]:
]


def trajectory_eval_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_evaluate_trajectory",
description=("Score an agent's recorded 'trajectory' (a list of "
"{action, args, observation} steps) against a 'rubric' "
"with optional keys required_actions (+ordered), "
"forbidden_actions, max_steps, success_contains. "
"Returns {passed, score, steps, checks} for agent "
"regression testing."),
input_schema=schema(
{"trajectory": {"type": "array",
"items": {"type": "object"}},
"rubric": {"type": "object"}},
["trajectory", "rubric"]),
handler=h.evaluate_trajectory,
annotations=READ_ONLY,
),
]


def unattended_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -3788,6 +3809,7 @@ def media_assert_tools() -> List[MCPTool]:
ci_annotation_tools, clipboard_history_tools, audit_analysis_tools,
process_doc_tools, tween_drag_tools, plugin_sdk_tools, governance_tools,
credential_lease_tools, egress_tools, approval_testing_tools,
trajectory_eval_tools,
screen_record_tools,
process_and_shell_tools, remote_desktop_tools, gamepad_tools,
usb_passthrough_tools, assertion_tools, data_source_tools,
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1325,6 +1325,12 @@ def pending_artifacts(approvals_dir: str = ".approvals"):
return {"pending": _pending(approvals_dir)}


def evaluate_trajectory(trajectory, rubric):
from je_auto_control.utils.trajectory_eval import (
evaluate_trajectory as _evaluate)
return _evaluate(trajectory, rubric)


def vlm_locate(description: str,
screen_region: Optional[List[int]] = None,
model: Optional[str] = None) -> Optional[List[int]]:
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/utils/trajectory_eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Agent trajectory evaluation: score a recorded run against a rubric."""
from je_auto_control.utils.trajectory_eval.trajectory_eval import (
evaluate_trajectory,
)

__all__ = ["evaluate_trajectory"]
Loading
Loading