diff --git a/README.md b/README.md index 0e3929cc..23afba30 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ ## Table of Contents +- [What's new (2026-06-22) — String-Distance Similarity Metrics](#whats-new-2026-06-22--string-distance-similarity-metrics) - [What's new (2026-06-22) — Time-Series Transforms](#whats-new-2026-06-22--time-series-transforms) - [What's new (2026-06-22) — Unicode Text Normalisation & Slugify](#whats-new-2026-06-22--unicode-text-normalisation--slugify) - [What's new (2026-06-22) — JSON-Schema Compatibility Checking](#whats-new-2026-06-22--json-schema-compatibility-checking) @@ -151,6 +152,12 @@ --- +## What's new (2026-06-22) — String-Distance Similarity Metrics + +Match typos and reordered tokens. Full reference: [`docs/source/Eng/doc/new_features/v99_features_doc.rst`](docs/source/Eng/doc/new_features/v99_features_doc.rst). + +- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`** (`AC_text_similarity`): `fuzzy` exposed only difflib's gestalt ratio. This adds the edit-distance and token-set metrics it lacks — Jaro-Winkler (standard for short labels), Damerau (transposition-aware), and char-n-gram Jaccard/Dice — plus a unified `similarity()` that normalizes every metric to `[0, 1]`. Pairs with `normalize_text`. Pure-stdlib, deterministic. + ## What's new (2026-06-22) — Time-Series Transforms Turn counters into rates; downsample and resample. Full reference: [`docs/source/Eng/doc/new_features/v98_features_doc.rst`](docs/source/Eng/doc/new_features/v98_features_doc.rst). diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md index 0d97977a..ccdb98a1 100644 --- a/README/README_zh-CN.md +++ b/README/README_zh-CN.md @@ -12,6 +12,7 @@ ## 目录 +- [本次更新 (2026-06-22) — 字符串距离相似度量](#本次更新-2026-06-22--字符串距离相似度量) - [本次更新 (2026-06-22) — 时间序列变换](#本次更新-2026-06-22--时间序列变换) - [本次更新 (2026-06-22) — Unicode 文本规范化与 Slug](#本次更新-2026-06-22--unicode-文本规范化与-slug) - [本次更新 (2026-06-22) — JSON-Schema 兼容性检查](#本次更新-2026-06-22--json-schema-兼容性检查) @@ -150,6 +151,12 @@ --- +## 本次更新 (2026-06-22) — 字符串距离相似度量 + +匹配打字错误与重排 token。完整参考:[`docs/source/Zh/doc/new_features/v99_features_doc.rst`](../docs/source/Zh/doc/new_features/v99_features_doc.rst)。 + +- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`**(`AC_text_similarity`):`fuzzy` 只提供 difflib 的 gestalt ratio。本功能补上它缺少的编辑距离与 token 集合度量 —— Jaro-Winkler(短标签标准)、Damerau(转置感知)、字符 n-gram Jaccard/Dice —— 并提供统一的 `similarity()` 把每个度量规范化到 `[0, 1]`。可搭配 `normalize_text`。纯标准库、确定。 + ## 本次更新 (2026-06-22) — 时间序列变换 把计数器转成速率;降采样与重采样。完整参考:[`docs/source/Zh/doc/new_features/v98_features_doc.rst`](../docs/source/Zh/doc/new_features/v98_features_doc.rst)。 diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md index 26b8e618..f4b5086b 100644 --- a/README/README_zh-TW.md +++ b/README/README_zh-TW.md @@ -12,6 +12,7 @@ ## 目錄 +- [本次更新 (2026-06-22) — 字串距離相似度量](#本次更新-2026-06-22--字串距離相似度量) - [本次更新 (2026-06-22) — 時間序列轉換](#本次更新-2026-06-22--時間序列轉換) - [本次更新 (2026-06-22) — Unicode 文字正規化與 Slug](#本次更新-2026-06-22--unicode-文字正規化與-slug) - [本次更新 (2026-06-22) — JSON-Schema 相容性檢查](#本次更新-2026-06-22--json-schema-相容性檢查) @@ -150,6 +151,12 @@ --- +## 本次更新 (2026-06-22) — 字串距離相似度量 + +比對打字錯誤與重排 token。完整參考:[`docs/source/Zh/doc/new_features/v99_features_doc.rst`](../docs/source/Zh/doc/new_features/v99_features_doc.rst)。 + +- **`levenshtein` / `damerau_levenshtein` / `jaro` / `jaro_winkler` / `jaccard` / `dice` / `similarity`**(`AC_text_similarity`):`fuzzy` 只提供 difflib 的 gestalt ratio。本功能補上它缺少的編輯距離與 token 集合度量 —— Jaro-Winkler(短標籤標準)、Damerau(轉置感知)、字元 n-gram Jaccard/Dice —— 並提供統一的 `similarity()` 把每個度量正規化到 `[0, 1]`。可搭配 `normalize_text`。純標準函式庫、具決定性。 + ## 本次更新 (2026-06-22) — 時間序列轉換 把計數器轉成速率;降採樣與重採樣。完整參考:[`docs/source/Zh/doc/new_features/v98_features_doc.rst`](../docs/source/Zh/doc/new_features/v98_features_doc.rst)。 diff --git a/docs/source/Eng/doc/new_features/v99_features_doc.rst b/docs/source/Eng/doc/new_features/v99_features_doc.rst new file mode 100644 index 00000000..a6ecbf1e --- /dev/null +++ b/docs/source/Eng/doc/new_features/v99_features_doc.rst @@ -0,0 +1,44 @@ +String-Distance Similarity Metrics +================================== + +``fuzzy`` exposes only difflib's gestalt ratio. This adds the edit-distance and +token-set metrics it lacks — Levenshtein / Damerau-Levenshtein, Jaro and +Jaro-Winkler (the standard for short names and labels), and character-n-gram +Jaccard / Dice — for better matching of typos and reordered tokens, especially +from OCR. + +Pure standard library; imports no ``PySide6``. Every function is pure (two +strings in, a number out), so it is fully deterministic in CI. Pair with +``normalize_text`` to make matches accent- and form-insensitive first. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import ( + levenshtein, damerau_levenshtein, jaro_winkler, jaccard, dice, + similarity, normalize_text, + ) + + levenshtein("kitten", "sitting") # 3 + damerau_levenshtein("ab", "ba") # 1 (transposition) + jaro_winkler("MARTHA", "MARHTA") # ~0.961 + jaccard("night", "nacht", n=2) # char-bigram overlap + + # normalised [0, 1] score for any metric (edit distance -> 1 - d/max_len): + similarity(normalize_text(a), normalize_text(b), metric="jaro_winkler") + +``levenshtein`` / ``damerau_levenshtein`` return integer edit distances (the +latter counting an adjacent transposition as one edit). ``jaro`` / ``jaro_winkler`` +and ``jaccard`` / ``dice`` return ``[0, 1]`` similarities. ``similarity`` is the +unified entry point — it returns the Jaro/Jaccard/Dice metrics directly and +converts edit distances to ``1 - distance / max_len`` so every metric is +comparable on the same scale. + +Executor command +---------------- + +``AC_text_similarity`` returns ``{score}`` for two strings ``a`` / ``b`` and an +optional ``metric``. It is exposed as the MCP tool ``ac_text_similarity`` and as +a Script Builder command under **Data**. diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst index f984ad30..7680301d 100644 --- a/docs/source/Eng/eng_index.rst +++ b/docs/source/Eng/eng_index.rst @@ -121,6 +121,7 @@ Comprehensive guides for all AutoControl features. doc/new_features/v96_features_doc doc/new_features/v97_features_doc doc/new_features/v98_features_doc + doc/new_features/v99_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/docs/source/Zh/doc/new_features/v99_features_doc.rst b/docs/source/Zh/doc/new_features/v99_features_doc.rst new file mode 100644 index 00000000..0722db8a --- /dev/null +++ b/docs/source/Zh/doc/new_features/v99_features_doc.rst @@ -0,0 +1,37 @@ +字串距離相似度量 +============== + +``fuzzy`` 只提供 difflib 的 gestalt ratio。本功能補上它缺少的編輯距離與 token 集合度量 —— +Levenshtein / Damerau-Levenshtein、Jaro 與 Jaro-Winkler(短名稱/標籤的標準),以及字元 n-gram 的 +Jaccard / Dice —— 更適合比對打字錯誤與重排 token,尤其來自 OCR。 + +純標準函式庫;不匯入 ``PySide6``。每個函式皆為純函式(輸入兩個字串、輸出數字),因此在 CI 中完全 +具決定性。可先搭配 ``normalize_text`` 讓比對對重音與形式不敏感。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import ( + levenshtein, damerau_levenshtein, jaro_winkler, jaccard, dice, + similarity, normalize_text, + ) + + levenshtein("kitten", "sitting") # 3 + damerau_levenshtein("ab", "ba") # 1(轉置) + jaro_winkler("MARTHA", "MARHTA") # ~0.961 + jaccard("night", "nacht", n=2) # 字元 bigram 重疊 + + # 任一度量的正規化 [0, 1] 分數(編輯距離 → 1 - d/max_len): + similarity(normalize_text(a), normalize_text(b), metric="jaro_winkler") + +``levenshtein`` / ``damerau_levenshtein`` 回傳整數編輯距離(後者把相鄰轉置算作一次編輯)。``jaro`` / +``jaro_winkler`` 與 ``jaccard`` / ``dice`` 回傳 ``[0, 1]`` 相似度。``similarity`` 是統一入口 —— 直接回傳 +Jaro/Jaccard/Dice,並把編輯距離轉成 ``1 - distance / max_len``,讓所有度量在同一尺度上可比較。 + +執行器命令 +---------- + +``AC_text_similarity`` 對兩個字串 ``a`` / ``b`` 與選用的 ``metric`` 回傳 ``{score}``。它以 MCP 工具 +``ac_text_similarity`` 以及 Script Builder 中 **Data** 分類下的命令提供。 diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst index 1775667f..033341ef 100644 --- a/docs/source/Zh/zh_index.rst +++ b/docs/source/Zh/zh_index.rst @@ -121,6 +121,7 @@ AutoControl 所有功能的完整使用指南。 doc/new_features/v96_features_doc doc/new_features/v97_features_doc doc/new_features/v98_features_doc + doc/new_features/v99_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py index 6c313630..f8ff0b87 100644 --- a/je_auto_control/__init__.py +++ b/je_auto_control/__init__.py @@ -257,6 +257,11 @@ from je_auto_control.utils.text_normalize import ( deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify, ) +# String-distance metrics (Levenshtein / Jaro-Winkler / Jaccard / Dice) +from je_auto_control.utils.text_similarity import ( + damerau_levenshtein, dice, jaccard, jaro, jaro_winkler, levenshtein, + similarity, +) # S3-compatible artifact store (optional boto3, injectable client) from je_auto_control.utils.artifact_store import ( S3ArtifactStore, configure_default_store, get_default_store, @@ -928,6 +933,8 @@ def start_autocontrol_gui(*args, **kwargs): "fuzzy_best_match", "fuzzy_dedupe", "fuzzy_matches", "fuzzy_ratio", "deaccent", "fold_whitespace", "normalize_quotes", "normalize_text", "slugify", + "damerau_levenshtein", "dice", "jaccard", "jaro", "jaro_winkler", + "levenshtein", "similarity", "S3ArtifactStore", "configure_default_store", "get_default_store", "set_default_store", "average_hash", "dedupe_images", "dhash", "hamming_distance", diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py index 76ecfc32..8fd416a9 100644 --- a/je_auto_control/gui/script_builder/command_schema.py +++ b/je_auto_control/gui/script_builder/command_schema.py @@ -1668,6 +1668,16 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None: ), description="Produce an ASCII slug (de-accent, lowercase, join).", )) + specs.append(CommandSpec( + "AC_text_similarity", "Data", "Text: Similarity", + fields=( + FieldSpec("a", FieldType.STRING, placeholder="login"), + FieldSpec("b", FieldType.STRING, placeholder="lgoin"), + FieldSpec("metric", FieldType.STRING, optional=True, + placeholder="jaro_winkler | levenshtein | jaccard | dice"), + ), + description="Normalised string similarity (Jaro-Winkler / edit / Jaccard).", + )) specs.append(CommandSpec( "AC_spans_to_otlp", "Report", "OTLP: Export Spans", fields=( diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py index 02d9b6e8..4f8441d7 100644 --- a/je_auto_control/utils/executor/action_executor.py +++ b/je_auto_control/utils/executor/action_executor.py @@ -3152,6 +3152,13 @@ def _slugify(text: str, sep: str = "-") -> Dict[str, Any]: return {"slug": slugify(text, sep=sep)} +def _text_similarity(a: str, b: str, + metric: str = "jaro_winkler") -> Dict[str, Any]: + """Adapter: normalised string similarity for the chosen metric.""" + from je_auto_control.utils.text_similarity import similarity + return {"score": similarity(a, b, metric=metric)} + + def _canonical_log(fields: Any) -> Dict[str, Any]: """Adapter: build a canonical log line from a fields dict.""" import json @@ -4461,6 +4468,7 @@ def __init__(self): "AC_spans_to_otlp": _spans_to_otlp, "AC_normalize_text": _normalize_text, "AC_slugify": _slugify, + "AC_text_similarity": _text_similarity, "AC_validate_config": _validate_config, "AC_resolve_ref": _resolve_ref, "AC_resolve_refs": _resolve_refs, diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py index fb975d1b..b15c02c5 100644 --- a/je_auto_control/utils/mcp_server/tools/_factories.py +++ b/je_auto_control/utils/mcp_server/tools/_factories.py @@ -3810,6 +3810,23 @@ def otlp_export_tools() -> List[MCPTool]: ] +def text_similarity_tools() -> List[MCPTool]: + return [ + MCPTool( + name="ac_text_similarity", + description=("Normalised [0,1] string similarity between 'a' and 'b' " + "for 'metric' (levenshtein / damerau_levenshtein / jaro " + "/ jaro_winkler / jaccard / dice). Returns {score}."), + input_schema=schema( + {"a": {"type": "string"}, "b": {"type": "string"}, + "metric": {"type": "string"}}, + ["a", "b"]), + handler=h.text_similarity, + annotations=READ_ONLY, + ), + ] + + def text_normalize_tools() -> List[MCPTool]: return [ MCPTool( @@ -5435,7 +5452,7 @@ def media_assert_tools() -> List[MCPTool]: feature_flag_tools, provenance_tools, json_contract_tools, chaos_tools, slo_tools, percentiles_tools, bulkhead_tools, http_cassette_tools, trace_context_tools, baggage_tools, canonical_log_tools, otlp_export_tools, - text_normalize_tools, + text_normalize_tools, text_similarity_tools, secret_ref_tools, config_schema_tools, config_redaction_tools, data_profile_tools, http_problem_tools, dotenv_tools, sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools, diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py index 20b1683a..a2326013 100644 --- a/je_auto_control/utils/mcp_server/tools/_handlers.py +++ b/je_auto_control/utils/mcp_server/tools/_handlers.py @@ -1746,6 +1746,11 @@ def slugify(text, sep="-"): return _slugify(text, sep) +def text_similarity(a, b, metric="jaro_winkler"): + from je_auto_control.utils.executor.action_executor import _text_similarity + return _text_similarity(a, b, metric) + + def canonical_log(fields): from je_auto_control.utils.executor.action_executor import _canonical_log return _canonical_log(fields) diff --git a/je_auto_control/utils/text_similarity/__init__.py b/je_auto_control/utils/text_similarity/__init__.py new file mode 100644 index 00000000..0642af01 --- /dev/null +++ b/je_auto_control/utils/text_similarity/__init__.py @@ -0,0 +1,10 @@ +"""String-distance metrics for AutoControl text matching.""" +from je_auto_control.utils.text_similarity.text_similarity import ( + damerau_levenshtein, dice, jaccard, jaro, jaro_winkler, levenshtein, + similarity, +) + +__all__ = [ + "damerau_levenshtein", "dice", "jaccard", "jaro", "jaro_winkler", + "levenshtein", "similarity", +] diff --git a/je_auto_control/utils/text_similarity/text_similarity.py b/je_auto_control/utils/text_similarity/text_similarity.py new file mode 100644 index 00000000..cc009c57 --- /dev/null +++ b/je_auto_control/utils/text_similarity/text_similarity.py @@ -0,0 +1,155 @@ +"""String-distance metrics for matching short labels and OCR text. + +``fuzzy`` exposes only difflib's gestalt ratio. This adds the edit-distance and +token-set metrics it lacks — Levenshtein / Damerau-Levenshtein, Jaro and +Jaro-Winkler (the standard for short names/labels), and char-n-gram Jaccard / +Dice — for better matching of typos and reordered tokens. + +Pure standard library; imports no ``PySide6``. Every function is pure (two +strings in, number out), so it is fully deterministic in CI. +""" +from typing import Set + +_METRICS = ("levenshtein", "damerau_levenshtein", "jaro", "jaro_winkler", + "jaccard", "dice") + + +def levenshtein(a: str, b: str) -> int: + """Levenshtein edit distance between ``a`` and ``b``.""" + if a == b: + return 0 + if not a: + return len(b) + if not b: + return len(a) + previous = list(range(len(b) + 1)) + for i, char_a in enumerate(a, 1): + current = [i] + for j, char_b in enumerate(b, 1): + cost = 0 if char_a == char_b else 1 + current.append(min(previous[j] + 1, current[j - 1] + 1, + previous[j - 1] + cost)) + previous = current + return previous[-1] + + +def _osa_cell(d, a, b, i, j): + cost = 0 if a[i - 1] == b[j - 1] else 1 + value = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost) + if i > 1 and j > 1 and a[i - 1] == b[j - 2] and a[i - 2] == b[j - 1]: + value = min(value, d[i - 2][j - 2] + 1) + return value + + +def damerau_levenshtein(a: str, b: str) -> int: + """Optimal string-alignment distance (allows adjacent transpositions).""" + if not a: + return len(b) + if not b: + return len(a) + rows, cols = len(a) + 1, len(b) + 1 + d = [[0] * cols for _ in range(rows)] + for i in range(rows): + d[i][0] = i + for j in range(cols): + d[0][j] = j + for i in range(1, rows): + for j in range(1, cols): + d[i][j] = _osa_cell(d, a, b, i, j) + return d[-1][-1] + + +def _match_flags(a: str, b: str, max_dist: int): + a_match = [False] * len(a) + b_match = [False] * len(b) + matches = 0 + for i, char_a in enumerate(a): + start = max(0, i - max_dist) + end = min(i + max_dist + 1, len(b)) + for j in range(start, end): + if not b_match[j] and char_a == b[j]: + a_match[i] = b_match[j] = True + matches += 1 + break + return a_match, b_match, matches + + +def _transpositions(a: str, b: str, a_match, b_match) -> int: + transposed = 0 + k = 0 + for i, flag in enumerate(a_match): + if flag: + while not b_match[k]: + k += 1 + if a[i] != b[k]: + transposed += 1 + k += 1 + return transposed // 2 + + +def jaro(a: str, b: str) -> float: + """Jaro similarity in ``[0, 1]``.""" + if a == b: + return 1.0 + if not a or not b: + return 0.0 + max_dist = max(len(a), len(b)) // 2 - 1 + a_match, b_match, matches = _match_flags(a, b, max_dist) + if matches == 0: + return 0.0 + transposed = _transpositions(a, b, a_match, b_match) + return (matches / len(a) + matches / len(b) + + (matches - transposed) / matches) / 3 + + +def jaro_winkler(a: str, b: str, *, prefix_weight: float = 0.1) -> float: + """Jaro-Winkler similarity (boosts a common prefix up to 4 chars).""" + score = jaro(a, b) + prefix = 0 + for char_a, char_b in zip(a, b): + if char_a != char_b or prefix >= 4: + break + prefix += 1 + return score + prefix * prefix_weight * (1 - score) + + +def _ngrams(text: str, n: int) -> Set[str]: + text = text or "" + if len(text) < n: + return {text} if text else set() + return {text[i:i + n] for i in range(len(text) - n + 1)} + + +def jaccard(a: str, b: str, *, n: int = 2) -> float: + """Jaccard similarity of character ``n``-gram sets.""" + set_a, set_b = _ngrams(a, n), _ngrams(b, n) + union = set_a | set_b + return len(set_a & set_b) / len(union) if union else 1.0 + + +def dice(a: str, b: str, *, n: int = 2) -> float: + """Sørensen-Dice coefficient of character ``n``-gram sets.""" + set_a, set_b = _ngrams(a, n), _ngrams(b, n) + total = len(set_a) + len(set_b) + return 2 * len(set_a & set_b) / total if total else 1.0 + + +_SIMILARITY_METRICS = {"jaro": jaro, "jaro_winkler": jaro_winkler, + "jaccard": jaccard, "dice": dice} +_DISTANCE_METRICS = {"levenshtein": levenshtein, + "damerau_levenshtein": damerau_levenshtein} + + +def similarity(a: str, b: str, *, metric: str = "jaro_winkler") -> float: + """Return a normalised ``[0, 1]`` similarity for ``metric``. + + Edit-distance metrics are converted to ``1 - distance / max_len``. + """ + if metric in _SIMILARITY_METRICS: + return _SIMILARITY_METRICS[metric](a, b) + if metric in _DISTANCE_METRICS: + longest = max(len(a), len(b)) + if longest == 0: + return 1.0 + return 1 - _DISTANCE_METRICS[metric](a, b) / longest + raise ValueError(f"unknown metric: {metric!r}; choose from {_METRICS}") diff --git a/test/unit_test/headless/test_text_similarity_batch.py b/test/unit_test/headless/test_text_similarity_batch.py new file mode 100644 index 00000000..bb11642c --- /dev/null +++ b/test/unit_test/headless/test_text_similarity_batch.py @@ -0,0 +1,72 @@ +"""Headless tests for string-distance metrics. Pure stdlib, no Qt.""" +import pytest + +import je_auto_control as ac +from je_auto_control.utils.text_similarity import ( + damerau_levenshtein, dice, jaccard, jaro, jaro_winkler, levenshtein, + similarity, +) + + +def test_levenshtein(): + assert levenshtein("kitten", "sitting") == 3 + assert levenshtein("abc", "abc") == 0 + assert levenshtein("", "abc") == 3 + + +def test_damerau_handles_transposition(): + assert levenshtein("ab", "ba") == 2 # two substitutions + assert damerau_levenshtein("ab", "ba") == 1 # one transposition + + +def test_jaro_and_winkler(): + assert jaro("MARTHA", "MARHTA") == pytest.approx(0.9444, abs=1e-3) + jw = jaro_winkler("MARTHA", "MARHTA") + assert jw == pytest.approx(0.9611, abs=1e-3) + assert jaro_winkler("abc", "abc") == pytest.approx(1.0) + # common prefix boosts winkler above plain jaro + assert jaro_winkler("prefix", "preXXX") >= jaro("prefix", "preXXX") + + +def test_jaccard_and_dice(): + assert jaccard("night", "night") == pytest.approx(1.0) + assert jaccard("abc", "xyz") == pytest.approx(0.0) + assert dice("night", "nacht", n=2) == pytest.approx(0.25) + assert jaccard("", "") == pytest.approx(1.0) + + +def test_similarity_normalises_edit_distance(): + # levenshtein("kitten","sitting")=3, max_len=7 -> 1 - 3/7 + assert similarity("kitten", "sitting", metric="levenshtein") == \ + pytest.approx(1 - 3 / 7) + assert similarity("abc", "abc", metric="levenshtein") == pytest.approx(1.0) + assert similarity("a", "b", metric="jaro_winkler") == pytest.approx(0.0) + + +def test_similarity_unknown_metric(): + with pytest.raises(ValueError): + similarity("a", "b", metric="nope") + + +# --- wiring --------------------------------------------------------------- + +def test_executor_round_trip(): + rec = ac.execute_action([[ + "AC_text_similarity", + {"a": "login", "b": "lgoin", "metric": "damerau_levenshtein"}]]) + score = next(v for v in rec.values() if isinstance(v, dict))["score"] + assert score == pytest.approx(1 - 1 / 5) # one transposition over len 5 + + +def test_wiring(): + assert "AC_text_similarity" in ac.executor.known_commands() + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + assert "ac_text_similarity" in {t.name for t in build_default_tool_registry()} + from je_auto_control.gui.script_builder.command_schema import _build_specs + assert "AC_text_similarity" in {s.command for s in _build_specs()} + + +def test_facade_exports(): + for attr in ("levenshtein", "damerau_levenshtein", "jaro", "jaro_winkler", + "jaccard", "dice", "similarity"): + assert hasattr(ac, attr) and attr in ac.__all__