Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-22) — String-Distance Similarity Metrics](#whats-new-2026-06-22--string-distance-similarity-metrics)
- [What's new (2026-06-22) — Time-Series Transforms](#whats-new-2026-06-22--time-series-transforms)
- [What's new (2026-06-22) — Unicode Text Normalisation & Slugify](#whats-new-2026-06-22--unicode-text-normalisation--slugify)
- [What's new (2026-06-22) — JSON-Schema Compatibility Checking](#whats-new-2026-06-22--json-schema-compatibility-checking)
Expand Down Expand Up @@ -151,6 +152,12 @@

---

## What's new (2026-06-22) — String-Distance Similarity Metrics

Match typos and reordered tokens. Full reference: [`docs/source/Eng/doc/new_features/v99_features_doc.rst`](docs/source/Eng/doc/new_features/v99_features_doc.rst).

- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`** (`AC_text_similarity`): `fuzzy` exposed only difflib's gestalt ratio. This adds the edit-distance and token-set metrics it lacks — Jaro-Winkler (standard for short labels), Damerau (transposition-aware), and char-n-gram Jaccard/Dice — plus a unified `similarity()` that normalizes every metric to `[0, 1]`. Pairs with `normalize_text`. Pure-stdlib, deterministic.

## What's new (2026-06-22) — Time-Series Transforms

Turn counters into rates; downsample and resample. Full reference: [`docs/source/Eng/doc/new_features/v98_features_doc.rst`](docs/source/Eng/doc/new_features/v98_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-22) — 字符串距离相似度量](#本次更新-2026-06-22--字符串距离相似度量)
- [本次更新 (2026-06-22) — 时间序列变换](#本次更新-2026-06-22--时间序列变换)
- [本次更新 (2026-06-22) — Unicode 文本规范化与 Slug](#本次更新-2026-06-22--unicode-文本规范化与-slug)
- [本次更新 (2026-06-22) — JSON-Schema 兼容性检查](#本次更新-2026-06-22--json-schema-兼容性检查)
Expand Down Expand Up @@ -150,6 +151,12 @@

---

## 本次更新 (2026-06-22) — 字符串距离相似度量

匹配打字错误与重排 token。完整参考:[`docs/source/Zh/doc/new_features/v99_features_doc.rst`](../docs/source/Zh/doc/new_features/v99_features_doc.rst)。

- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`**(`AC_text_similarity`):`fuzzy` 只提供 difflib 的 gestalt ratio。本功能补上它缺少的编辑距离与 token 集合度量 —— Jaro-Winkler(短标签标准)、Damerau(转置感知)、字符 n-gram Jaccard/Dice —— 并提供统一的 `similarity()` 把每个度量规范化到 `[0, 1]`。可搭配 `normalize_text`。纯标准库、确定。

## 本次更新 (2026-06-22) — 时间序列变换

把计数器转成速率;降采样与重采样。完整参考:[`docs/source/Zh/doc/new_features/v98_features_doc.rst`](../docs/source/Zh/doc/new_features/v98_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-22) — 字串距離相似度量](#本次更新-2026-06-22--字串距離相似度量)
- [本次更新 (2026-06-22) — 時間序列轉換](#本次更新-2026-06-22--時間序列轉換)
- [本次更新 (2026-06-22) — Unicode 文字正規化與 Slug](#本次更新-2026-06-22--unicode-文字正規化與-slug)
- [本次更新 (2026-06-22) — JSON-Schema 相容性檢查](#本次更新-2026-06-22--json-schema-相容性檢查)
Expand Down Expand Up @@ -150,6 +151,12 @@

---

## 本次更新 (2026-06-22) — 字串距離相似度量

比對打字錯誤與重排 token。完整參考:[`docs/source/Zh/doc/new_features/v99_features_doc.rst`](../docs/source/Zh/doc/new_features/v99_features_doc.rst)。

- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`**(`AC_text_similarity`):`fuzzy` 只提供 difflib 的 gestalt ratio。本功能補上它缺少的編輯距離與 token 集合度量 —— Jaro-Winkler(短標籤標準)、Damerau(轉置感知)、字元 n-gram Jaccard/Dice —— 並提供統一的 `similarity()` 把每個度量正規化到 `[0, 1]`。可搭配 `normalize_text`。純標準函式庫、具決定性。

## 本次更新 (2026-06-22) — 時間序列轉換

把計數器轉成速率;降採樣與重採樣。完整參考:[`docs/source/Zh/doc/new_features/v98_features_doc.rst`](../docs/source/Zh/doc/new_features/v98_features_doc.rst)。
Expand Down
44 changes: 44 additions & 0 deletions docs/source/Eng/doc/new_features/v99_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
String-Distance Similarity Metrics
==================================

``fuzzy`` exposes only difflib's gestalt ratio. This adds the edit-distance and
token-set metrics it lacks — Levenshtein / Damerau-Levenshtein, Jaro and
Jaro-Winkler (the standard for short names and labels), and character-n-gram
Jaccard / Dice — for better matching of typos and reordered tokens, especially
from OCR.

Pure standard library; imports no ``PySide6``. Every function is pure (two
strings in, a number out), so it is fully deterministic in CI. Pair with
``normalize_text`` to make matches accent- and form-insensitive first.

Headless API
------------

.. code-block:: python

from je_auto_control import (
levenshtein, damerau_levenshtein, jaro_winkler, jaccard, dice,
similarity, normalize_text,
)

levenshtein("kitten", "sitting") # 3
damerau_levenshtein("ab", "ba") # 1 (transposition)
jaro_winkler("MARTHA", "MARHTA") # ~0.961
jaccard("night", "nacht", n=2) # char-bigram overlap

# normalised [0, 1] score for any metric (edit distance -> 1 - d/max_len):
similarity(normalize_text(a), normalize_text(b), metric="jaro_winkler")

``levenshtein`` / ``damerau_levenshtein`` return integer edit distances (the
latter counting an adjacent transposition as one edit). ``jaro`` / ``jaro_winkler``
and ``jaccard`` / ``dice`` return ``[0, 1]`` similarities. ``similarity`` is the
unified entry point — it returns the Jaro/Jaccard/Dice metrics directly and
converts edit distances to ``1 - distance / max_len`` so every metric is
comparable on the same scale.

Executor command
----------------

``AC_text_similarity`` returns ``{score}`` for two strings ``a`` / ``b`` and an
optional ``metric``. It is exposed as the MCP tool ``ac_text_similarity`` and as
a Script Builder command under **Data**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v96_features_doc
doc/new_features/v97_features_doc
doc/new_features/v98_features_doc
doc/new_features/v99_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
37 changes: 37 additions & 0 deletions docs/source/Zh/doc/new_features/v99_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
字串距離相似度量
==============

``fuzzy`` 只提供 difflib 的 gestalt ratio。本功能補上它缺少的編輯距離與 token 集合度量 ——
Levenshtein / Damerau-Levenshtein、Jaro 與 Jaro-Winkler(短名稱/標籤的標準),以及字元 n-gram 的
Jaccard / Dice —— 更適合比對打字錯誤與重排 token,尤其來自 OCR。

純標準函式庫;不匯入 ``PySide6``。每個函式皆為純函式(輸入兩個字串、輸出數字),因此在 CI 中完全
具決定性。可先搭配 ``normalize_text`` 讓比對對重音與形式不敏感。

無頭 API
--------

.. code-block:: python

from je_auto_control import (
levenshtein, damerau_levenshtein, jaro_winkler, jaccard, dice,
similarity, normalize_text,
)

levenshtein("kitten", "sitting") # 3
damerau_levenshtein("ab", "ba") # 1(轉置)
jaro_winkler("MARTHA", "MARHTA") # ~0.961
jaccard("night", "nacht", n=2) # 字元 bigram 重疊

# 任一度量的正規化 [0, 1] 分數(編輯距離 → 1 - d/max_len):
similarity(normalize_text(a), normalize_text(b), metric="jaro_winkler")

``levenshtein`` / ``damerau_levenshtein`` 回傳整數編輯距離(後者把相鄰轉置算作一次編輯)。``jaro`` /
``jaro_winkler`` 與 ``jaccard`` / ``dice`` 回傳 ``[0, 1]`` 相似度。``similarity`` 是統一入口 —— 直接回傳
Jaro/Jaccard/Dice,並把編輯距離轉成 ``1 - distance / max_len``,讓所有度量在同一尺度上可比較。

執行器命令
----------

``AC_text_similarity`` 對兩個字串 ``a`` / ``b`` 與選用的 ``metric`` 回傳 ``{score}``。它以 MCP 工具
``ac_text_similarity`` 以及 Script Builder 中 **Data** 分類下的命令提供。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v96_features_doc
doc/new_features/v97_features_doc
doc/new_features/v98_features_doc
doc/new_features/v99_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
7 changes: 7 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,11 @@
from je_auto_control.utils.text_normalize import (
deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify,
)
# String-distance metrics (Levenshtein / Jaro-Winkler / Jaccard / Dice)
from je_auto_control.utils.text_similarity import (
damerau_levenshtein, dice, jaccard, jaro, jaro_winkler, levenshtein,
similarity,
)
# S3-compatible artifact store (optional boto3, injectable client)
from je_auto_control.utils.artifact_store import (
S3ArtifactStore, configure_default_store, get_default_store,
Expand Down Expand Up @@ -928,6 +933,8 @@ def start_autocontrol_gui(*args, **kwargs):
"fuzzy_best_match", "fuzzy_dedupe", "fuzzy_matches", "fuzzy_ratio",
"deaccent", "fold_whitespace", "normalize_quotes", "normalize_text",
"slugify",
"damerau_levenshtein", "dice", "jaccard", "jaro", "jaro_winkler",
"levenshtein", "similarity",
"S3ArtifactStore", "configure_default_store", "get_default_store",
"set_default_store",
"average_hash", "dedupe_images", "dhash", "hamming_distance",
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1668,6 +1668,16 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None:
),
description="Produce an ASCII slug (de-accent, lowercase, join).",
))
specs.append(CommandSpec(
"AC_text_similarity", "Data", "Text: Similarity",
fields=(
FieldSpec("a", FieldType.STRING, placeholder="login"),
FieldSpec("b", FieldType.STRING, placeholder="lgoin"),
FieldSpec("metric", FieldType.STRING, optional=True,
placeholder="jaro_winkler | levenshtein | jaccard | dice"),
),
description="Normalised string similarity (Jaro-Winkler / edit / Jaccard).",
))
specs.append(CommandSpec(
"AC_spans_to_otlp", "Report", "OTLP: Export Spans",
fields=(
Expand Down
8 changes: 8 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3152,6 +3152,13 @@ def _slugify(text: str, sep: str = "-") -> Dict[str, Any]:
return {"slug": slugify(text, sep=sep)}


def _text_similarity(a: str, b: str,
metric: str = "jaro_winkler") -> Dict[str, Any]:
"""Adapter: normalised string similarity for the chosen metric."""
from je_auto_control.utils.text_similarity import similarity
return {"score": similarity(a, b, metric=metric)}


def _canonical_log(fields: Any) -> Dict[str, Any]:
"""Adapter: build a canonical log line from a fields dict."""
import json
Expand Down Expand Up @@ -4461,6 +4468,7 @@ def __init__(self):
"AC_spans_to_otlp": _spans_to_otlp,
"AC_normalize_text": _normalize_text,
"AC_slugify": _slugify,
"AC_text_similarity": _text_similarity,
"AC_validate_config": _validate_config,
"AC_resolve_ref": _resolve_ref,
"AC_resolve_refs": _resolve_refs,
Expand Down
19 changes: 18 additions & 1 deletion je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -3810,6 +3810,23 @@ def otlp_export_tools() -> List[MCPTool]:
]


def text_similarity_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_text_similarity",
description=("Normalised [0,1] string similarity between 'a' and 'b' "
"for 'metric' (levenshtein / damerau_levenshtein / jaro "
"/ jaro_winkler / jaccard / dice). Returns {score}."),
input_schema=schema(
{"a": {"type": "string"}, "b": {"type": "string"},
"metric": {"type": "string"}},
["a", "b"]),
handler=h.text_similarity,
annotations=READ_ONLY,
),
]


def text_normalize_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -5435,7 +5452,7 @@ def media_assert_tools() -> List[MCPTool]:
feature_flag_tools, provenance_tools, json_contract_tools, chaos_tools,
slo_tools, percentiles_tools, bulkhead_tools, http_cassette_tools,
trace_context_tools, baggage_tools, canonical_log_tools, otlp_export_tools,
text_normalize_tools,
text_normalize_tools, text_similarity_tools,
secret_ref_tools, config_schema_tools, config_redaction_tools,
data_profile_tools, http_problem_tools, dotenv_tools,
sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools,
Expand Down
5 changes: 5 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1746,6 +1746,11 @@ def slugify(text, sep="-"):
return _slugify(text, sep)


def text_similarity(a, b, metric="jaro_winkler"):
from je_auto_control.utils.executor.action_executor import _text_similarity
return _text_similarity(a, b, metric)


def canonical_log(fields):
from je_auto_control.utils.executor.action_executor import _canonical_log
return _canonical_log(fields)
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/utils/text_similarity/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""String-distance metrics for AutoControl text matching."""
from je_auto_control.utils.text_similarity.text_similarity import (
damerau_levenshtein, dice, jaccard, jaro, jaro_winkler, levenshtein,
similarity,
)

__all__ = [
"damerau_levenshtein", "dice", "jaccard", "jaro", "jaro_winkler",
"levenshtein", "similarity",
]
Loading
Loading