diff --git a/README.md b/README.md index 73ef5f29..9dc5e697 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ ## Table of Contents +- [What's new (2026-06-22) — Single-Series Anomaly Detection](#whats-new-2026-06-22--single-series-anomaly-detection) - [What's new (2026-06-22) — Near-Duplicate Text Detection (SimHash / MinHash)](#whats-new-2026-06-22--near-duplicate-text-detection-simhash--minhash) - [What's new (2026-06-22) — String-Distance Similarity Metrics](#whats-new-2026-06-22--string-distance-similarity-metrics) - [What's new (2026-06-22) — Time-Series Transforms](#whats-new-2026-06-22--time-series-transforms) @@ -153,6 +154,12 @@ --- +## What's new (2026-06-22) — Single-Series Anomaly Detection + +Flag the spike in one live metric series. Full reference: [`docs/source/Eng/doc/new_features/v101_features_doc.rst`](docs/source/Eng/doc/new_features/v101_features_doc.rst). + +- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`** (`AC_detect_anomalies`): `data_drift` is two-batch distribution shift and `slo.burn_alerts` only thresholds budget burn — neither points at *which* value in one series is anomalous. This flags outliers via robust MAD (modified z-score), plain z-score, and an EWMA control chart (with an optional in-control baseline) — `{index, value, score, is_anomaly}` records. Pure-stdlib, deterministic. + ## What's new (2026-06-22) — Near-Duplicate Text Detection (SimHash / MinHash) Fingerprint text to find near-dups at scale. Full reference: [`docs/source/Eng/doc/new_features/v100_features_doc.rst`](docs/source/Eng/doc/new_features/v100_features_doc.rst). diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md index 81f4abff..88181797 100644 --- a/README/README_zh-CN.md +++ b/README/README_zh-CN.md @@ -12,6 +12,7 @@ ## 目录 +- [本次更新 (2026-06-22) — 单序列异常检测](#本次更新-2026-06-22--单序列异常检测) - [本次更新 (2026-06-22) — 近似重复文本检测(SimHash / MinHash)](#本次更新-2026-06-22--近似重复文本检测simhash--minhash) - [本次更新 (2026-06-22) — 字符串距离相似度量](#本次更新-2026-06-22--字符串距离相似度量) - [本次更新 (2026-06-22) — 时间序列变换](#本次更新-2026-06-22--时间序列变换) @@ -152,6 +153,12 @@ --- +## 本次更新 (2026-06-22) — 单序列异常检测 + +标记单一实时度量序列中的尖峰。完整参考:[`docs/source/Zh/doc/new_features/v101_features_doc.rst`](../docs/source/Zh/doc/new_features/v101_features_doc.rst)。 + +- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`**(`AC_detect_anomalies`):`data_drift` 是两批次分布偏移,`slo.burn_alerts` 只对预算燃烧设门槛 —— 都无法指出单一序列中*哪个*值异常。本功能以稳健 MAD(modified z-score)、纯 z-score 与 EWMA 控制图(可选 in-control 基准)标记离群值 —— `{index, value, score, is_anomaly}` 记录。纯标准库、确定。 + ## 本次更新 (2026-06-22) — 近似重复文本检测(SimHash / MinHash) 为文本生成指纹以大规模找近似重复。完整参考:[`docs/source/Zh/doc/new_features/v100_features_doc.rst`](../docs/source/Zh/doc/new_features/v100_features_doc.rst)。 diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md index 97715983..c564c6da 100644 --- a/README/README_zh-TW.md +++ b/README/README_zh-TW.md @@ -12,6 +12,7 @@ ## 目錄 +- [本次更新 (2026-06-22) — 單序列異常偵測](#本次更新-2026-06-22--單序列異常偵測) - [本次更新 (2026-06-22) — 近似重複文字偵測(SimHash / MinHash)](#本次更新-2026-06-22--近似重複文字偵測simhash--minhash) - [本次更新 (2026-06-22) — 字串距離相似度量](#本次更新-2026-06-22--字串距離相似度量) - [本次更新 (2026-06-22) — 時間序列轉換](#本次更新-2026-06-22--時間序列轉換) @@ -152,6 +153,12 @@ --- +## 本次更新 (2026-06-22) — 單序列異常偵測 + +標記單一即時度量序列中的尖峰。完整參考:[`docs/source/Zh/doc/new_features/v101_features_doc.rst`](../docs/source/Zh/doc/new_features/v101_features_doc.rst)。 + +- **`detect_anomalies` / `mad_anomalies` / `zscore_anomalies` / `ewma_control`**(`AC_detect_anomalies`):`data_drift` 是兩批次分布偏移,`slo.burn_alerts` 只對預算燃燒設門檻 —— 都無法指出單一序列中*哪個*值異常。本功能以穩健 MAD(modified z-score)、純 z-score 與 EWMA 控制圖(可選 in-control 基準)標記離群值 —— `{index, value, score, is_anomaly}` 記錄。純標準函式庫、具決定性。 + ## 本次更新 (2026-06-22) — 近似重複文字偵測(SimHash / MinHash) 為文字產生指紋以大規模找近似重複。完整參考:[`docs/source/Zh/doc/new_features/v100_features_doc.rst`](../docs/source/Zh/doc/new_features/v100_features_doc.rst)。 diff --git a/docs/source/Eng/doc/new_features/v101_features_doc.rst b/docs/source/Eng/doc/new_features/v101_features_doc.rst new file mode 100644 index 00000000..06b9359d --- /dev/null +++ b/docs/source/Eng/doc/new_features/v101_features_doc.rst @@ -0,0 +1,41 @@ +Single-Series Anomaly Detection +=============================== + +``data_drift`` answers "did the *distribution* shift between two batches" — it +cannot point at *which* value in one live series is anomalous — and +``slo.burn_alerts`` only thresholds error-budget burn, not arbitrary metric +values (latency spikes, cost spikes, CPU). This flags outliers in a single +series via z-score, robust MAD (modified z-score), and an EWMA control chart. + +Pure standard library (``math`` / ``statistics``); imports no ``PySide6``. Every +function is pure (values in, flags out), so it is fully deterministic in CI. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import detect_anomalies, mad_anomalies, ewma_control + + series = [10, 11, 9, 10, 12, 10, 95, 11, 10] # index 6 is the spike + mad_anomalies(series) # [6] (robust) + detect_anomalies(series, method="mad") + # [{index, value, score, is_anomaly}, ...] + + ewma_control(values, alpha=0.5, target_mean=10, target_sigma=1) # shift indices + +``detect_anomalies`` scores each value (``mad`` default, or ``zscore``) and flags +those past the threshold (3.5 for MAD, 3.0 for z-score). ``mad_anomalies`` / +``zscore_anomalies`` return just the flagged indices, and ``mad_scores`` / +``zscore_scores`` the raw scores. MAD (Iglewicz-Hoaglin modified z-score) is +robust to outliers inflating the spread, so it stays sensitive where a plain +z-score would not. ``ewma_control`` is an EWMA control chart for sustained +level shifts — pass ``target_mean`` / ``target_sigma`` for an in-control +baseline (else the series' own stats). + +Executor command +---------------- + +``AC_detect_anomalies`` takes a ``values`` list (optional ``method`` / +``threshold``) and returns ``{results}``. It is exposed as the MCP tool +``ac_detect_anomalies`` and as a Script Builder command under **Data**. diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst index c8f8465b..17f7ac91 100644 --- a/docs/source/Eng/eng_index.rst +++ b/docs/source/Eng/eng_index.rst @@ -123,6 +123,7 @@ Comprehensive guides for all AutoControl features. doc/new_features/v98_features_doc doc/new_features/v99_features_doc doc/new_features/v100_features_doc + doc/new_features/v101_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/docs/source/Zh/doc/new_features/v101_features_doc.rst b/docs/source/Zh/doc/new_features/v101_features_doc.rst new file mode 100644 index 00000000..4f15d536 --- /dev/null +++ b/docs/source/Zh/doc/new_features/v101_features_doc.rst @@ -0,0 +1,35 @@ +單序列異常偵測 +============ + +``data_drift`` 回答的是「兩個批次之間*分布*是否偏移」—— 無法指出單一即時序列中*哪一個*值異常 —— +而 ``slo.burn_alerts`` 只對錯誤預算燃燒設門檻,不針對任意度量值(延遲尖峰、成本尖峰、CPU)。本功能以 +z-score、穩健的 MAD(modified z-score)與 EWMA 控制圖,在單一序列中標記離群值。 + +純標準函式庫(``math`` / ``statistics``);不匯入 ``PySide6``。每個函式皆為純函式(輸入值、輸出旗標), +因此在 CI 中完全具決定性。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import detect_anomalies, mad_anomalies, ewma_control + + series = [10, 11, 9, 10, 12, 10, 95, 11, 10] # 索引 6 為尖峰 + mad_anomalies(series) # [6](穩健) + detect_anomalies(series, method="mad") + # [{index, value, score, is_anomaly}, ...] + + ewma_control(values, alpha=0.5, target_mean=10, target_sigma=1) # 偏移索引 + +``detect_anomalies`` 為每個值評分(預設 ``mad``,或 ``zscore``)並標記超過門檻者(MAD 為 3.5、z-score +為 3.0)。``mad_anomalies`` / ``zscore_anomalies`` 只回傳被標記的索引,``mad_scores`` / ``zscore_scores`` +回傳原始分數。MAD(Iglewicz-Hoaglin modified z-score)對離群值膨脹離散度具穩健性,因此在純 z-score 失靈 +之處仍保持敏感。``ewma_control`` 是針對持續性水準偏移的 EWMA 控制圖 —— 傳入 ``target_mean`` / +``target_sigma`` 設定 in-control 基準(否則用序列自身統計)。 + +執行器命令 +---------- + +``AC_detect_anomalies`` 接受 ``values`` 清單(可選 ``method`` / ``threshold``)回傳 ``{results}``。它以 +MCP 工具 ``ac_detect_anomalies`` 以及 Script Builder 中 **Data** 分類下的命令提供。 diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst index eea510ec..b1457a45 100644 --- a/docs/source/Zh/zh_index.rst +++ b/docs/source/Zh/zh_index.rst @@ -123,6 +123,7 @@ AutoControl 所有功能的完整使用指南。 doc/new_features/v98_features_doc doc/new_features/v99_features_doc doc/new_features/v100_features_doc + doc/new_features/v101_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py index 75b0ba17..485fbadc 100644 --- a/je_auto_control/__init__.py +++ b/je_auto_control/__init__.py @@ -429,6 +429,11 @@ ts_delta, ts_downsample, ts_idelta, ts_increase, ts_irate, ts_rate, ts_resample, ) +# Single-series anomaly detection (z-score / MAD / EWMA control) +from je_auto_control.utils.anomaly import ( + detect_anomalies, ewma_control, mad_anomalies, mad_scores, + zscore_anomalies, zscore_scores, +) # Bulkhead concurrency isolation + rate-limit header parsing from je_auto_control.utils.bulkhead import ( Bulkhead, BulkheadFullError, next_delay, parse_ratelimit, parse_retry_after, @@ -998,6 +1003,8 @@ def start_autocontrol_gui(*args, **kwargs): "LatencyDigest", "exact_percentiles", "ts_delta", "ts_downsample", "ts_idelta", "ts_increase", "ts_irate", "ts_rate", "ts_resample", + "detect_anomalies", "ewma_control", "mad_anomalies", "mad_scores", + "zscore_anomalies", "zscore_scores", "Bulkhead", "BulkheadFullError", "next_delay", "parse_ratelimit", "parse_retry_after", "Cassette", "CassetteMissError", diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py index 3fbb54e4..8edecef3 100644 --- a/je_auto_control/gui/script_builder/command_schema.py +++ b/je_auto_control/gui/script_builder/command_schema.py @@ -1968,6 +1968,17 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None: ), description="Roll a series into tumbling buckets by aggregate.", )) + specs.append(CommandSpec( + "AC_detect_anomalies", "Data", "Anomaly: Detect in Series", + fields=( + FieldSpec("values", FieldType.STRING, + placeholder="[10, 11, 9, 10, 95, 10]"), + FieldSpec("method", FieldType.STRING, optional=True, + placeholder="mad | zscore"), + FieldSpec("threshold", FieldType.FLOAT, optional=True), + ), + description="Flag outliers in a numeric series (MAD / z-score).", + )) specs.append(CommandSpec( "AC_diff_rows", "Data", "Dataset Diff: Rows by Key", fields=( diff --git a/je_auto_control/utils/anomaly/__init__.py b/je_auto_control/utils/anomaly/__init__.py new file mode 100644 index 00000000..7b29eca2 --- /dev/null +++ b/je_auto_control/utils/anomaly/__init__.py @@ -0,0 +1,10 @@ +"""Single-series anomaly detection for AutoControl.""" +from je_auto_control.utils.anomaly.anomaly import ( + detect_anomalies, ewma_control, mad_anomalies, mad_scores, + zscore_anomalies, zscore_scores, +) + +__all__ = [ + "detect_anomalies", "ewma_control", "mad_anomalies", "mad_scores", + "zscore_anomalies", "zscore_scores", +] diff --git a/je_auto_control/utils/anomaly/anomaly.py b/je_auto_control/utils/anomaly/anomaly.py new file mode 100644 index 00000000..db6100e0 --- /dev/null +++ b/je_auto_control/utils/anomaly/anomaly.py @@ -0,0 +1,97 @@ +"""Anomaly detection over a single numeric series. + +``data_drift`` answers "did the *distribution* shift between two batches" — it +cannot point at *which* value in one live series is anomalous, and +``slo.burn_alerts`` only thresholds error-budget burn, not arbitrary metric +values (latency spikes, cost spikes, CPU). This flags outliers in one series via +z-score, robust MAD (modified z-score), and an EWMA control chart. + +Pure standard library (``math`` / ``statistics``); imports no ``PySide6``. Every +function is pure (values in, flags out), so it is fully deterministic in CI. +""" +import math +import statistics +from typing import Any, Dict, List, Sequence + + +def zscore_scores(values: Sequence[float]) -> List[float]: + """Standard z-scores of ``values`` (0.0 when stdev is 0).""" + count = len(values) + if count < 2: + return [0.0] * count + mean = statistics.fmean(values) + sigma = statistics.pstdev(values) + if sigma == 0: + return [0.0] * count + return [(value - mean) / sigma for value in values] + + +def mad_scores(values: Sequence[float]) -> List[float]: + """Iglewicz-Hoaglin modified z-scores (robust, median/MAD based).""" + if not values: + return [] + median = statistics.median(values) + mad = statistics.median([abs(value - median) for value in values]) + if mad == 0: + return [0.0] * len(values) + return [0.6745 * (value - median) / mad for value in values] + + +def zscore_anomalies(values: Sequence[float], *, + threshold: float = 3.0) -> List[int]: + """Indices whose z-score magnitude exceeds ``threshold``.""" + return [i for i, score in enumerate(zscore_scores(values)) + if abs(score) > threshold] + + +def mad_anomalies(values: Sequence[float], *, + threshold: float = 3.5) -> List[int]: + """Indices whose modified z-score magnitude exceeds ``threshold``.""" + return [i for i, score in enumerate(mad_scores(values)) + if abs(score) > threshold] + + +def ewma_control(values: Sequence[float], *, alpha: float = 0.3, + limit: float = 3.0, target_mean: Any = None, + target_sigma: Any = None) -> List[int]: + """Indices where the EWMA breaches its control limits (control chart). + + ``target_mean`` / ``target_sigma`` set the in-control baseline; when omitted + they default to the series' own mean and population stdev. + """ + if not values: + return [] + mean = statistics.fmean(values) if target_mean is None else float(target_mean) + if target_sigma is not None: + sigma = float(target_sigma) + else: + sigma = statistics.pstdev(values) if len(values) > 1 else 0.0 + ewma = mean + flagged: List[int] = [] + for i, value in enumerate(values): + ewma = alpha * value + (1 - alpha) * ewma + spread = math.sqrt(alpha / (2 - alpha) + * (1 - (1 - alpha) ** (2 * (i + 1)))) + bound = limit * sigma * spread + if ewma > mean + bound or ewma < mean - bound: + flagged.append(i) + return flagged + + +def detect_anomalies(values: Sequence[float], *, method: str = "mad", + threshold: Any = None) -> List[Dict[str, Any]]: + """Score each value and flag anomalies via ``method`` (``mad``/``zscore``). + + Returns ``[{index, value, score, is_anomaly}]``. + """ + if method == "zscore": + scores = zscore_scores(values) + cutoff = 3.0 if threshold is None else float(threshold) + elif method == "mad": + scores = mad_scores(values) + cutoff = 3.5 if threshold is None else float(threshold) + else: + raise ValueError(f"unknown method: {method!r}; use 'mad' or 'zscore'") + return [{"index": i, "value": values[i], "score": scores[i], + "is_anomaly": abs(scores[i]) > cutoff} + for i in range(len(values))] diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py index cf859378..45187de4 100644 --- a/je_auto_control/utils/executor/action_executor.py +++ b/je_auto_control/utils/executor/action_executor.py @@ -3412,6 +3412,17 @@ def _ts_downsample(series: Any, bucket_s: Any, return {"buckets": [list(point) for point in buckets]} +def _detect_anomalies(values: Any, method: str = "mad", + threshold: Any = None) -> Dict[str, Any]: + """Adapter: flag anomalies in a numeric series (mad/zscore).""" + import json + from je_auto_control.utils.anomaly import detect_anomalies + if isinstance(values, str): + values = json.loads(values) + return {"results": detect_anomalies(values, method=method, + threshold=threshold)} + + def _evaluate_slo(records: Any, target: float, window_s: Optional[float] = None) -> Dict[str, Any]: """Adapter: SLI + error budget for outcome records (list or JSON string).""" @@ -4512,6 +4523,7 @@ def __init__(self): "AC_check_compatibility": _check_compatibility, "AC_ts_rate": _ts_rate, "AC_ts_downsample": _ts_downsample, + "AC_detect_anomalies": _detect_anomalies, "AC_detect_drift": _detect_drift, "AC_categorical_drift": _categorical_drift, "AC_diff_rows": _diff_rows, diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py index c5e39ae8..706de390 100644 --- a/je_auto_control/utils/mcp_server/tools/_factories.py +++ b/je_auto_control/utils/mcp_server/tools/_factories.py @@ -3548,6 +3548,23 @@ def dataset_diff_tools() -> List[MCPTool]: ] +def anomaly_tools() -> List[MCPTool]: + return [ + MCPTool( + name="ac_detect_anomalies", + description=("Flag anomalies in a numeric 'values' series by 'method' " + "(mad / zscore) with optional 'threshold'. Returns " + "{results: [{index, value, score, is_anomaly}]}."), + input_schema=schema( + {"values": {"type": "array"}, "method": {"type": "string"}, + "threshold": {"type": "number"}}, + ["values"]), + handler=h.detect_anomalies, + annotations=READ_ONLY, + ), + ] + + def timeseries_tools() -> List[MCPTool]: return [ MCPTool( @@ -5482,7 +5499,7 @@ def media_assert_tools() -> List[MCPTool]: secret_ref_tools, config_schema_tools, config_redaction_tools, data_profile_tools, http_problem_tools, dotenv_tools, sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools, - timeseries_tools, + timeseries_tools, anomaly_tools, dataset_diff_tools, referential_tools, link_header_tools, multipart_tools, http_content_tools, cookie_jar_tools, http_conditional_tools, saga_tools, decision_table_tools, locator_repair_tools, diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py index 0d991f85..a576f103 100644 --- a/je_auto_control/utils/mcp_server/tools/_handlers.py +++ b/je_auto_control/utils/mcp_server/tools/_handlers.py @@ -1905,6 +1905,11 @@ def ts_downsample(series, bucket_s, agg="avg"): return _ts_downsample(series, bucket_s, agg) +def detect_anomalies(values, method="mad", threshold=None): + from je_auto_control.utils.executor.action_executor import _detect_anomalies + return _detect_anomalies(values, method, threshold) + + def detect_drift(reference, current, threshold=0.25, bins=10): from je_auto_control.utils.executor.action_executor import _detect_drift return _detect_drift(reference, current, threshold, bins) diff --git a/test/unit_test/headless/test_anomaly_batch.py b/test/unit_test/headless/test_anomaly_batch.py new file mode 100644 index 00000000..8c9b95f6 --- /dev/null +++ b/test/unit_test/headless/test_anomaly_batch.py @@ -0,0 +1,76 @@ +"""Headless tests for single-series anomaly detection. Pure stdlib, no Qt.""" +import json + +import pytest + +import je_auto_control as ac +from je_auto_control.utils.anomaly import ( + detect_anomalies, ewma_control, mad_anomalies, mad_scores, + zscore_anomalies, zscore_scores, +) + +_SERIES = [10, 11, 9, 10, 12, 10, 95, 11, 10] # index 6 is the spike + + +def test_zscore_flags_spike(): + assert 6 in zscore_anomalies(_SERIES, threshold=2.0) + assert zscore_anomalies([5, 5, 5, 5]) == [] # zero variance → none + assert len(zscore_scores(_SERIES)) == len(_SERIES) + + +def test_mad_is_robust(): + # MAD ignores the spike's inflation of the spread, so it still flags it + assert mad_anomalies(_SERIES) == [6] + scores = mad_scores(_SERIES) + assert abs(scores[6]) > abs(scores[0]) + + +def test_ewma_control_flags_shift(): + flat = [10.0] * 10 + assert ewma_control(flat) == [] # stable → no breach + shifted = [10] * 5 + [40] * 5 # sustained level shift + # baseline established from the stable level → the shift breaches + assert ewma_control(shifted, alpha=0.5, + target_mean=10, target_sigma=1) != [] + + +def test_detect_anomalies_records(): + results = detect_anomalies(_SERIES, method="mad") + assert len(results) == len(_SERIES) + spike = results[6] + assert spike["index"] == 6 and spike["value"] == 95 + assert spike["is_anomaly"] is True + assert results[0]["is_anomaly"] is False + + +def test_detect_zscore_and_threshold(): + loose = detect_anomalies(_SERIES, method="zscore", threshold=10.0) + assert all(r["is_anomaly"] is False for r in loose) # threshold too high + + +def test_detect_unknown_method(): + with pytest.raises(ValueError): + detect_anomalies(_SERIES, method="nope") + + +# --- wiring --------------------------------------------------------------- + +def test_executor_round_trip(): + rec = ac.execute_action([[ + "AC_detect_anomalies", {"values": json.dumps(_SERIES)}]]) + results = next(v for v in rec.values() if isinstance(v, dict))["results"] + assert results[6]["is_anomaly"] is True + + +def test_wiring(): + assert "AC_detect_anomalies" in ac.executor.known_commands() + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + assert "ac_detect_anomalies" in {t.name for t in build_default_tool_registry()} + from je_auto_control.gui.script_builder.command_schema import _build_specs + assert "AC_detect_anomalies" in {s.command for s in _build_specs()} + + +def test_facade_exports(): + for attr in ("detect_anomalies", "ewma_control", "mad_anomalies", + "mad_scores", "zscore_anomalies", "zscore_scores"): + assert hasattr(ac, attr) and attr in ac.__all__