Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-22) — Single-Series Anomaly Detection](#whats-new-2026-06-22--single-series-anomaly-detection)
- [What's new (2026-06-22) — Near-Duplicate Text Detection (SimHash / MinHash)](#whats-new-2026-06-22--near-duplicate-text-detection-simhash--minhash)
- [What's new (2026-06-22) — String-Distance Similarity Metrics](#whats-new-2026-06-22--string-distance-similarity-metrics)
- [What's new (2026-06-22) — Time-Series Transforms](#whats-new-2026-06-22--time-series-transforms)
Expand Down Expand Up @@ -153,6 +154,12 @@

---

## What's new (2026-06-22) — Single-Series Anomaly Detection

Flag the spike in one live metric series. Full reference: [`docs/source/Eng/doc/new_features/v101_features_doc.rst`](docs/source/Eng/doc/new_features/v101_features_doc.rst).

- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`** (`AC_detect_anomalies`): `data_drift` is two-batch distribution shift and `slo.burn_alerts` only thresholds budget burn — neither points at *which* value in one series is anomalous. This flags outliers via robust MAD (modified z-score), plain z-score, and an EWMA control chart (with an optional in-control baseline) — `{index, value, score, is_anomaly}` records. Pure-stdlib, deterministic.

## What's new (2026-06-22) — Near-Duplicate Text Detection (SimHash / MinHash)

Fingerprint text to find near-dups at scale. Full reference: [`docs/source/Eng/doc/new_features/v100_features_doc.rst`](docs/source/Eng/doc/new_features/v100_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-22) — 单序列异常检测](#本次更新-2026-06-22--单序列异常检测)
- [本次更新 (2026-06-22) — 近似重复文本检测(SimHash / MinHash)](#本次更新-2026-06-22--近似重复文本检测simhash--minhash)
- [本次更新 (2026-06-22) — 字符串距离相似度量](#本次更新-2026-06-22--字符串距离相似度量)
- [本次更新 (2026-06-22) — 时间序列变换](#本次更新-2026-06-22--时间序列变换)
Expand Down Expand Up @@ -152,6 +153,12 @@

---

## 本次更新 (2026-06-22) — 单序列异常检测

标记单一实时度量序列中的尖峰。完整参考:[`docs/source/Zh/doc/new_features/v101_features_doc.rst`](../docs/source/Zh/doc/new_features/v101_features_doc.rst)。

- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`**(`AC_detect_anomalies`):`data_drift` 是两批次分布偏移,`slo.burn_alerts` 只对预算燃烧设门槛 —— 都无法指出单一序列中*哪个*值异常。本功能以稳健 MAD(modified z-score)、纯 z-score 与 EWMA 控制图(可选 in-control 基准)标记离群值 —— `{index, value, score, is_anomaly}` 记录。纯标准库、确定。

## 本次更新 (2026-06-22) — 近似重复文本检测(SimHash / MinHash)

为文本生成指纹以大规模找近似重复。完整参考:[`docs/source/Zh/doc/new_features/v100_features_doc.rst`](../docs/source/Zh/doc/new_features/v100_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-22) — 單序列異常偵測](#本次更新-2026-06-22--單序列異常偵測)
- [本次更新 (2026-06-22) — 近似重複文字偵測(SimHash / MinHash)](#本次更新-2026-06-22--近似重複文字偵測simhash--minhash)
- [本次更新 (2026-06-22) — 字串距離相似度量](#本次更新-2026-06-22--字串距離相似度量)
- [本次更新 (2026-06-22) — 時間序列轉換](#本次更新-2026-06-22--時間序列轉換)
Expand Down Expand Up @@ -152,6 +153,12 @@

---

## 本次更新 (2026-06-22) — 單序列異常偵測

標記單一即時度量序列中的尖峰。完整參考:[`docs/source/Zh/doc/new_features/v101_features_doc.rst`](../docs/source/Zh/doc/new_features/v101_features_doc.rst)。

- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`**(`AC_detect_anomalies`):`data_drift` 是兩批次分布偏移,`slo.burn_alerts` 只對預算燃燒設門檻 —— 都無法指出單一序列中*哪個*值異常。本功能以穩健 MAD(modified z-score)、純 z-score 與 EWMA 控制圖(可選 in-control 基準)標記離群值 —— `{index, value, score, is_anomaly}` 記錄。純標準函式庫、具決定性。

## 本次更新 (2026-06-22) — 近似重複文字偵測(SimHash / MinHash)

為文字產生指紋以大規模找近似重複。完整參考:[`docs/source/Zh/doc/new_features/v100_features_doc.rst`](../docs/source/Zh/doc/new_features/v100_features_doc.rst)。
Expand Down
41 changes: 41 additions & 0 deletions docs/source/Eng/doc/new_features/v101_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Single-Series Anomaly Detection
===============================

``data_drift`` answers "did the *distribution* shift between two batches" — it
cannot point at *which* value in one live series is anomalous — and
``slo.burn_alerts`` only thresholds error-budget burn, not arbitrary metric
values (latency spikes, cost spikes, CPU). This flags outliers in a single
series via z-score, robust MAD (modified z-score), and an EWMA control chart.

Pure standard library (``math`` / ``statistics``); imports no ``PySide6``. Every
function is pure (values in, flags out), so it is fully deterministic in CI.

Headless API
------------

.. code-block:: python

from je_auto_control import detect_anomalies, mad_anomalies, ewma_control

series = [10, 11, 9, 10, 12, 10, 95, 11, 10] # index 6 is the spike
mad_anomalies(series) # [6] (robust)
detect_anomalies(series, method="mad")
# [{index, value, score, is_anomaly}, ...]

ewma_control(values, alpha=0.5, target_mean=10, target_sigma=1) # shift indices

``detect_anomalies`` scores each value (``mad`` default, or ``zscore``) and flags
those past the threshold (3.5 for MAD, 3.0 for z-score). ``mad_anomalies`` /
``zscore_anomalies`` return just the flagged indices, and ``mad_scores`` /
``zscore_scores`` the raw scores. MAD (Iglewicz-Hoaglin modified z-score) is
robust to outliers inflating the spread, so it stays sensitive where a plain
z-score would not. ``ewma_control`` is an EWMA control chart for sustained
level shifts — pass ``target_mean`` / ``target_sigma`` for an in-control
baseline (else the series' own stats).

Executor command
----------------

``AC_detect_anomalies`` takes a ``values`` list (optional ``method`` /
``threshold``) and returns ``{results}``. It is exposed as the MCP tool
``ac_detect_anomalies`` and as a Script Builder command under **Data**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v98_features_doc
doc/new_features/v99_features_doc
doc/new_features/v100_features_doc
doc/new_features/v101_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
35 changes: 35 additions & 0 deletions docs/source/Zh/doc/new_features/v101_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
單序列異常偵測
============

``data_drift`` 回答的是「兩個批次之間*分布*是否偏移」—— 無法指出單一即時序列中*哪一個*值異常 ——
而 ``slo.burn_alerts`` 只對錯誤預算燃燒設門檻,不針對任意度量值(延遲尖峰、成本尖峰、CPU)。本功能以
z-score、穩健的 MAD(modified z-score)與 EWMA 控制圖,在單一序列中標記離群值。

純標準函式庫(``math`` / ``statistics``);不匯入 ``PySide6``。每個函式皆為純函式(輸入值、輸出旗標),
因此在 CI 中完全具決定性。

無頭 API
--------

.. code-block:: python

from je_auto_control import detect_anomalies, mad_anomalies, ewma_control

series = [10, 11, 9, 10, 12, 10, 95, 11, 10] # 索引 6 為尖峰
mad_anomalies(series) # [6](穩健)
detect_anomalies(series, method="mad")
# [{index, value, score, is_anomaly}, ...]

ewma_control(values, alpha=0.5, target_mean=10, target_sigma=1) # 偏移索引

``detect_anomalies`` 為每個值評分(預設 ``mad``,或 ``zscore``)並標記超過門檻者(MAD 為 3.5、z-score
為 3.0)。``mad_anomalies`` / ``zscore_anomalies`` 只回傳被標記的索引,``mad_scores`` / ``zscore_scores``
回傳原始分數。MAD(Iglewicz-Hoaglin modified z-score)對離群值膨脹離散度具穩健性,因此在純 z-score 失靈
之處仍保持敏感。``ewma_control`` 是針對持續性水準偏移的 EWMA 控制圖 —— 傳入 ``target_mean`` /
``target_sigma`` 設定 in-control 基準(否則用序列自身統計)。

執行器命令
----------

``AC_detect_anomalies`` 接受 ``values`` 清單(可選 ``method`` / ``threshold``)回傳 ``{results}``。它以
MCP 工具 ``ac_detect_anomalies`` 以及 Script Builder 中 **Data** 分類下的命令提供。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v98_features_doc
doc/new_features/v99_features_doc
doc/new_features/v100_features_doc
doc/new_features/v101_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
7 changes: 7 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,11 @@
ts_delta, ts_downsample, ts_idelta, ts_increase, ts_irate, ts_rate,
ts_resample,
)
# Single-series anomaly detection (z-score / MAD / EWMA control)
from je_auto_control.utils.anomaly import (
detect_anomalies, ewma_control, mad_anomalies, mad_scores,
zscore_anomalies, zscore_scores,
)
# Bulkhead concurrency isolation + rate-limit header parsing
from je_auto_control.utils.bulkhead import (
Bulkhead, BulkheadFullError, next_delay, parse_ratelimit, parse_retry_after,
Expand Down Expand Up @@ -998,6 +1003,8 @@ def start_autocontrol_gui(*args, **kwargs):
"LatencyDigest", "exact_percentiles",
"ts_delta", "ts_downsample", "ts_idelta", "ts_increase", "ts_irate",
"ts_rate", "ts_resample",
"detect_anomalies", "ewma_control", "mad_anomalies", "mad_scores",
"zscore_anomalies", "zscore_scores",
"Bulkhead", "BulkheadFullError", "next_delay", "parse_ratelimit",
"parse_retry_after",
"Cassette", "CassetteMissError",
Expand Down
11 changes: 11 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1968,6 +1968,17 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None:
),
description="Roll a series into tumbling buckets by aggregate.",
))
specs.append(CommandSpec(
"AC_detect_anomalies", "Data", "Anomaly: Detect in Series",
fields=(
FieldSpec("values", FieldType.STRING,
placeholder="[10, 11, 9, 10, 95, 10]"),
FieldSpec("method", FieldType.STRING, optional=True,
placeholder="mad | zscore"),
FieldSpec("threshold", FieldType.FLOAT, optional=True),
),
description="Flag outliers in a numeric series (MAD / z-score).",
))
specs.append(CommandSpec(
"AC_diff_rows", "Data", "Dataset Diff: Rows by Key",
fields=(
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/utils/anomaly/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""Single-series anomaly detection for AutoControl."""
from je_auto_control.utils.anomaly.anomaly import (
detect_anomalies, ewma_control, mad_anomalies, mad_scores,
zscore_anomalies, zscore_scores,
)

__all__ = [
"detect_anomalies", "ewma_control", "mad_anomalies", "mad_scores",
"zscore_anomalies", "zscore_scores",
]
97 changes: 97 additions & 0 deletions je_auto_control/utils/anomaly/anomaly.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
"""Anomaly detection over a single numeric series.

``data_drift`` answers "did the *distribution* shift between two batches" — it
cannot point at *which* value in one live series is anomalous, and
``slo.burn_alerts`` only thresholds error-budget burn, not arbitrary metric
values (latency spikes, cost spikes, CPU). This flags outliers in one series via
z-score, robust MAD (modified z-score), and an EWMA control chart.

Pure standard library (``math`` / ``statistics``); imports no ``PySide6``. Every
function is pure (values in, flags out), so it is fully deterministic in CI.
"""
import math
import statistics
from typing import Any, Dict, List, Sequence


def zscore_scores(values: Sequence[float]) -> List[float]:
"""Standard z-scores of ``values`` (0.0 when stdev is 0)."""
count = len(values)
if count < 2:
return [0.0] * count
mean = statistics.fmean(values)
sigma = statistics.pstdev(values)
if sigma == 0:
return [0.0] * count
return [(value - mean) / sigma for value in values]


def mad_scores(values: Sequence[float]) -> List[float]:
"""Iglewicz-Hoaglin modified z-scores (robust, median/MAD based)."""
if not values:
return []
median = statistics.median(values)
mad = statistics.median([abs(value - median) for value in values])
if mad == 0:
return [0.0] * len(values)
return [0.6745 * (value - median) / mad for value in values]


def zscore_anomalies(values: Sequence[float], *,
threshold: float = 3.0) -> List[int]:
"""Indices whose z-score magnitude exceeds ``threshold``."""
return [i for i, score in enumerate(zscore_scores(values))
if abs(score) > threshold]


def mad_anomalies(values: Sequence[float], *,
threshold: float = 3.5) -> List[int]:
"""Indices whose modified z-score magnitude exceeds ``threshold``."""
return [i for i, score in enumerate(mad_scores(values))
if abs(score) > threshold]


def ewma_control(values: Sequence[float], *, alpha: float = 0.3,
limit: float = 3.0, target_mean: Any = None,
target_sigma: Any = None) -> List[int]:
"""Indices where the EWMA breaches its control limits (control chart).

``target_mean`` / ``target_sigma`` set the in-control baseline; when omitted
they default to the series' own mean and population stdev.
"""
if not values:
return []
mean = statistics.fmean(values) if target_mean is None else float(target_mean)
if target_sigma is not None:
sigma = float(target_sigma)
else:
sigma = statistics.pstdev(values) if len(values) > 1 else 0.0
ewma = mean
flagged: List[int] = []
for i, value in enumerate(values):
ewma = alpha * value + (1 - alpha) * ewma
spread = math.sqrt(alpha / (2 - alpha)
* (1 - (1 - alpha) ** (2 * (i + 1))))
bound = limit * sigma * spread
if ewma > mean + bound or ewma < mean - bound:
flagged.append(i)
return flagged


def detect_anomalies(values: Sequence[float], *, method: str = "mad",
threshold: Any = None) -> List[Dict[str, Any]]:
"""Score each value and flag anomalies via ``method`` (``mad``/``zscore``).

Returns ``[{index, value, score, is_anomaly}]``.
"""
if method == "zscore":
scores = zscore_scores(values)
cutoff = 3.0 if threshold is None else float(threshold)
elif method == "mad":
scores = mad_scores(values)
cutoff = 3.5 if threshold is None else float(threshold)
else:
raise ValueError(f"unknown method: {method!r}; use 'mad' or 'zscore'")
return [{"index": i, "value": values[i], "score": scores[i],
"is_anomaly": abs(scores[i]) > cutoff}
for i in range(len(values))]
12 changes: 12 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3412,6 +3412,17 @@ def _ts_downsample(series: Any, bucket_s: Any,
return {"buckets": [list(point) for point in buckets]}


def _detect_anomalies(values: Any, method: str = "mad",
threshold: Any = None) -> Dict[str, Any]:
"""Adapter: flag anomalies in a numeric series (mad/zscore)."""
import json
from je_auto_control.utils.anomaly import detect_anomalies
if isinstance(values, str):
values = json.loads(values)
return {"results": detect_anomalies(values, method=method,
threshold=threshold)}


def _evaluate_slo(records: Any, target: float,
window_s: Optional[float] = None) -> Dict[str, Any]:
"""Adapter: SLI + error budget for outcome records (list or JSON string)."""
Expand Down Expand Up @@ -4512,6 +4523,7 @@ def __init__(self):
"AC_check_compatibility": _check_compatibility,
"AC_ts_rate": _ts_rate,
"AC_ts_downsample": _ts_downsample,
"AC_detect_anomalies": _detect_anomalies,
"AC_detect_drift": _detect_drift,
"AC_categorical_drift": _categorical_drift,
"AC_diff_rows": _diff_rows,
Expand Down
19 changes: 18 additions & 1 deletion je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -3548,6 +3548,23 @@ def dataset_diff_tools() -> List[MCPTool]:
]


def anomaly_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_detect_anomalies",
description=("Flag anomalies in a numeric 'values' series by 'method' "
"(mad / zscore) with optional 'threshold'. Returns "
"{results: [{index, value, score, is_anomaly}]}."),
input_schema=schema(
{"values": {"type": "array"}, "method": {"type": "string"},
"threshold": {"type": "number"}},
["values"]),
handler=h.detect_anomalies,
annotations=READ_ONLY,
),
]


def timeseries_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -5482,7 +5499,7 @@ def media_assert_tools() -> List[MCPTool]:
secret_ref_tools, config_schema_tools, config_redaction_tools,
data_profile_tools, http_problem_tools, dotenv_tools,
sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools,
timeseries_tools,
timeseries_tools, anomaly_tools,
dataset_diff_tools, referential_tools, link_header_tools, multipart_tools,
http_content_tools, cookie_jar_tools, http_conditional_tools,
saga_tools, decision_table_tools, locator_repair_tools,
Expand Down
5 changes: 5 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1905,6 +1905,11 @@ def ts_downsample(series, bucket_s, agg="avg"):
return _ts_downsample(series, bucket_s, agg)


def detect_anomalies(values, method="mad", threshold=None):
from je_auto_control.utils.executor.action_executor import _detect_anomalies
return _detect_anomalies(values, method, threshold)


def detect_drift(reference, current, threshold=0.25, bins=10):
from je_auto_control.utils.executor.action_executor import _detect_drift
return _detect_drift(reference, current, threshold, bins)
Expand Down
Loading
Loading