diff --git a/README.md b/README.md index 505abf4b..573b3e61 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ ## Table of Contents +- [What's new (2026-06-22) — Confusable / Homoglyph Detection](#whats-new-2026-06-22--confusable--homoglyph-detection) - [What's new (2026-06-22) — Locale-Aware String Collation](#whats-new-2026-06-22--locale-aware-string-collation) - [What's new (2026-06-22) — Transactional Outbox](#whats-new-2026-06-22--transactional-outbox) - [What's new (2026-06-22) — Optimistic-Concurrency Versioned Store](#whats-new-2026-06-22--optimistic-concurrency-versioned-store) @@ -161,6 +162,12 @@ --- +## What's new (2026-06-22) — Confusable / Homoglyph Detection + +Catch Unicode visual spoofing (IDN-homograph phishing, lookalike labels). Full reference: [`docs/source/Eng/doc/new_features/v109_features_doc.rst`](docs/source/Eng/doc/new_features/v109_features_doc.rst). + +- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`** (`AC_confusable_scan`, `AC_confusable_compare`): a Cyrillic `"а"` is pixel-for-pixel a Latin `"a"`, so `"pаypal"` reads as `"paypal"` yet compares unequal. Following Unicode TR39, this folds confusables to a prototype skeleton (strings match when skeletons match) and flags mixed-script tokens. Pure-stdlib (`unicodedata`), deterministic. + ## What's new (2026-06-22) — Locale-Aware String Collation Sort strings the way a reader of the language expects. Full reference: [`docs/source/Eng/doc/new_features/v108_features_doc.rst`](docs/source/Eng/doc/new_features/v108_features_doc.rst). diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md index 100a175e..77534cc8 100644 --- a/README/README_zh-CN.md +++ b/README/README_zh-CN.md @@ -12,6 +12,7 @@ ## 目录 +- [本次更新 (2026-06-22) — 易混淆字符 / 同形异义字检测](#本次更新-2026-06-22--易混淆字符--同形异义字检测) - [本次更新 (2026-06-22) — 区域感知字符串排序](#本次更新-2026-06-22--区域感知字符串排序) - [本次更新 (2026-06-22) — 事务型 Outbox](#本次更新-2026-06-22--事务型-outbox) - [本次更新 (2026-06-22) — 乐观并发版本存储](#本次更新-2026-06-22--乐观并发版本存储) @@ -164,6 +165,12 @@ 平滑噪声值序列。完整参考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。 +## 本次更新 (2026-06-22) — 易混淆字符 / 同形异义字检测 + +抓出 Unicode 视觉仿冒(IDN 同形异义字钓鱼、仿冒标签)。完整参考:[`docs/source/Zh/doc/new_features/v109_features_doc.rst`](../docs/source/Zh/doc/new_features/v109_features_doc.rst)。 + +- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`**(`AC_confusable_scan`、`AC_confusable_compare`):西里尔字母 `"а"` 与拉丁字母 `"a"` 在像素上相同,因此 `"pаypal"` 读来是 `"paypal"` 却比较不相等。参照 Unicode TR39,本功能将易混淆字折叠为原型骨架(骨架相同即相符),并标记混用文字系统的令牌。纯标准库(`unicodedata`)、确定。 + ## 本次更新 (2026-06-22) — 区域感知字符串排序 依某语言读者的期望排序字符串。完整参考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。 diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md index 1106d8cc..ab0595b8 100644 --- a/README/README_zh-TW.md +++ b/README/README_zh-TW.md @@ -12,6 +12,7 @@ ## 目錄 +- [本次更新 (2026-06-22) — 易混淆字元 / 同形異義字偵測](#本次更新-2026-06-22--易混淆字元--同形異義字偵測) - [本次更新 (2026-06-22) — 地區感知字串排序](#本次更新-2026-06-22--地區感知字串排序) - [本次更新 (2026-06-22) — 交易型 Outbox](#本次更新-2026-06-22--交易型-outbox) - [本次更新 (2026-06-22) — 樂觀並行版本儲存](#本次更新-2026-06-22--樂觀並行版本儲存) @@ -164,6 +165,12 @@ 平滑雜訊值序列。完整參考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。 +## 本次更新 (2026-06-22) — 易混淆字元 / 同形異義字偵測 + +抓出 Unicode 視覺仿冒(IDN 同形異義字釣魚、仿冒標籤)。完整參考:[`docs/source/Zh/doc/new_features/v109_features_doc.rst`](../docs/source/Zh/doc/new_features/v109_features_doc.rst)。 + +- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`**(`AC_confusable_scan`、`AC_confusable_compare`):西里爾字母 `"а"` 與拉丁字母 `"a"` 在像素上相同,因此 `"pаypal"` 讀來是 `"paypal"` 卻比較不相等。參照 Unicode TR39,本功能將易混淆字折疊為原型骨架(骨架相同即相符),並標記混用文字系統的權杖。純標準函式庫(`unicodedata`)、具決定性。 + ## 本次更新 (2026-06-22) — 地區感知字串排序 依某語言讀者的期望排序字串。完整參考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。 diff --git a/docs/source/Eng/doc/new_features/v109_features_doc.rst b/docs/source/Eng/doc/new_features/v109_features_doc.rst new file mode 100644 index 00000000..a77ac77b --- /dev/null +++ b/docs/source/Eng/doc/new_features/v109_features_doc.rst @@ -0,0 +1,45 @@ +Confusable / Homoglyph Detection +================================ + +``secrets_scan`` finds secret-shaped tokens and ``guardrail`` screens text for +prompt injection, but nothing catches *visual* spoofing: a Cyrillic ``"а"`` +(U+0430) is pixel-for-pixel a Latin ``"a"`` (U+0061), so ``"pаypal"`` (with a +Cyrillic ``а``) reads as ``"paypal"`` to a human yet compares unequal — the basis +of IDN-homograph phishing and lookalike UI labels. + +Following the idea of Unicode TR39, this folds confusable characters to a +prototype *skeleton* (two strings are confusable when their skeletons match) and +flags strings that mix scripts. Pure standard library (``unicodedata``); imports +no ``PySide6``. Every function is pure, so it is fully deterministic in CI. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import ( + confusable_skeleton, is_confusable, detect_homoglyphs, + is_mixed_script, scripts_of, + ) + + confusable_skeleton("pаypal") # 'paypal' (Cyrillic а -> a) + is_confusable("pаypal", "paypal") # True + detect_homoglyphs("pаypal") # [{'index': 1, 'char': 'а', 'prototype': 'a'}] + is_mixed_script("pаypal") # True (Latin + Cyrillic) + scripts_of("pаypal") # {'LATIN', 'CYRILLIC'} + +``confusable_skeleton`` NFKC-normalises (folding fullwidth, ligatures and math +alphanumerics) then maps each remaining cross-script lookalike to its Latin +prototype. ``is_confusable`` is true only for *distinct* strings with equal +skeletons. ``detect_homoglyphs`` returns the offending characters with their +position and prototype. ``scripts_of`` / ``is_mixed_script`` classify characters +by Unicode block (ignoring digits, punctuation and spaces) so a single mixed- +script token can be flagged on its own. + +Executor commands +----------------- + +``AC_confusable_scan`` returns ``{skeleton, homoglyphs, mixed_script, scripts}`` +for one string; ``AC_confusable_compare`` returns ``{confusable}`` for a pair. +Both are exposed as MCP tools (``ac_confusable_scan`` / ``ac_confusable_compare``) +and as Script Builder commands under **Data**. diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst index 03f73773..9ab9e0e9 100644 --- a/docs/source/Eng/eng_index.rst +++ b/docs/source/Eng/eng_index.rst @@ -131,6 +131,7 @@ Comprehensive guides for all AutoControl features. doc/new_features/v106_features_doc doc/new_features/v107_features_doc doc/new_features/v108_features_doc + doc/new_features/v109_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/docs/source/Zh/doc/new_features/v109_features_doc.rst b/docs/source/Zh/doc/new_features/v109_features_doc.rst new file mode 100644 index 00000000..073d3f6d --- /dev/null +++ b/docs/source/Zh/doc/new_features/v109_features_doc.rst @@ -0,0 +1,38 @@ +易混淆字元 / 同形異義字偵測 +========================== + +``secrets_scan`` 找出疑似機密的權杖、``guardrail`` 篩檢提示注入,但沒有任何功能能抓出*視覺*仿冒:西里爾字母 +``"а"``(U+0430)與拉丁字母 ``"a"``(U+0061)在像素上完全相同,因此 ``"pаypal"``(其中 ``а`` 為西里爾字母) +對人類讀來就是 ``"paypal"``,但比較卻不相等——這正是 IDN 同形異義字釣魚與仿冒 UI 標籤的根源。 + +本功能參照 Unicode TR39 的概念,將易混淆字元折疊為原型*骨架*(skeleton)(兩字串骨架相同即為易混淆),並標記 +混用多種文字系統的字串。純標準函式庫(``unicodedata``);不匯入 ``PySide6``。每個函式皆為純函式,因此在 CI 中 +完全具決定性。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import ( + confusable_skeleton, is_confusable, detect_homoglyphs, + is_mixed_script, scripts_of, + ) + + confusable_skeleton("pаypal") # 'paypal' (西里爾 а -> a) + is_confusable("pаypal", "paypal") # True + detect_homoglyphs("pаypal") # [{'index': 1, 'char': 'а', 'prototype': 'a'}] + is_mixed_script("pаypal") # True (拉丁 + 西里爾) + scripts_of("pаypal") # {'LATIN', 'CYRILLIC'} + +``confusable_skeleton`` 先以 NFKC 正規化(折疊全形、連字與數學英數字),再將每個剩餘的跨文字系統仿冒字對映到 +其拉丁原型。``is_confusable`` 僅在兩個*不同*字串骨架相同時為真。``detect_homoglyphs`` 回傳有問題的字元連同其 +位置與原型。``scripts_of`` / ``is_mixed_script`` 依 Unicode 區塊將字元分類(忽略數字、標點與空白),因此可單獨 +標記一個混用文字系統的權杖。 + +執行器命令 +---------- + +``AC_confusable_scan`` 對單一字串回傳 ``{skeleton, homoglyphs, mixed_script, scripts}``;``AC_confusable_compare`` +對一組字串回傳 ``{confusable}``。兩者皆以 MCP 工具(``ac_confusable_scan`` / ``ac_confusable_compare``)以及 +Script Builder 中 **Data** 分類下的命令提供。 diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst index 3600e172..8c7a5ad7 100644 --- a/docs/source/Zh/zh_index.rst +++ b/docs/source/Zh/zh_index.rst @@ -131,6 +131,7 @@ AutoControl 所有功能的完整使用指南。 doc/new_features/v106_features_doc doc/new_features/v107_features_doc doc/new_features/v108_features_doc + doc/new_features/v109_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py index 7d206198..62eb8be1 100644 --- a/je_auto_control/__init__.py +++ b/je_auto_control/__init__.py @@ -218,6 +218,11 @@ collation_key, sort_strings, ) from je_auto_control.utils.locale_collation import compare as collation_compare +# Confusable / homoglyph detection (Unicode-spoofing skeletons) +from je_auto_control.utils.confusables import ( + detect_homoglyphs, is_confusable, is_mixed_script, scripts_of, +) +from je_auto_control.utils.confusables import skeleton as confusable_skeleton # CI workflow annotations (GitHub Actions) from je_auto_control.utils.ci_annotations import ( emit_annotations, format_annotation, @@ -951,6 +956,11 @@ def start_autocontrol_gui(*args, **kwargs): "collation_key", "collation_compare", "sort_strings", + "confusable_skeleton", + "detect_homoglyphs", + "is_confusable", + "is_mixed_script", + "scripts_of", "emit_annotations", "format_annotation", "ClipboardHistory", "default_clipboard_history", "analyze_heal_log", "heal_stats", "scan_secrets", diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py index 87c5c2ca..3cb0e15f 100644 --- a/je_auto_control/gui/script_builder/command_schema.py +++ b/je_auto_control/gui/script_builder/command_schema.py @@ -2090,6 +2090,21 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None: ), description="Locale-aware compare; returns order -1/0/1.", )) + specs.append(CommandSpec( + "AC_confusable_scan", "Data", "Text: Confusable Scan", + fields=( + FieldSpec("text", FieldType.STRING, placeholder="pаypal.com"), + ), + description="Homoglyph / mixed-script spoofing report for a string.", + )) + specs.append(CommandSpec( + "AC_confusable_compare", "Data", "Text: Confusable Compare", + fields=( + FieldSpec("first", FieldType.STRING, placeholder="paypal"), + FieldSpec("second", FieldType.STRING, placeholder="pаypal"), + ), + description="Whether two strings share the same confusable skeleton.", + )) specs.append(CommandSpec( "AC_diff_rows", "Data", "Dataset Diff: Rows by Key", fields=( diff --git a/je_auto_control/utils/confusables/__init__.py b/je_auto_control/utils/confusables/__init__.py new file mode 100644 index 00000000..5f31e92e --- /dev/null +++ b/je_auto_control/utils/confusables/__init__.py @@ -0,0 +1,9 @@ +"""Confusable / homoglyph detection (Unicode-spoofing skeletons).""" +from je_auto_control.utils.confusables.confusables import ( + detect_homoglyphs, is_confusable, is_mixed_script, scripts_of, skeleton, +) + +__all__ = [ + "detect_homoglyphs", "is_confusable", "is_mixed_script", "scripts_of", + "skeleton", +] diff --git a/je_auto_control/utils/confusables/confusables.py b/je_auto_control/utils/confusables/confusables.py new file mode 100644 index 00000000..b094ff32 --- /dev/null +++ b/je_auto_control/utils/confusables/confusables.py @@ -0,0 +1,103 @@ +"""Confusable / homoglyph detection (Unicode-spoofing skeletons + mixed script). + +``secrets_scan`` finds secret-shaped tokens and ``guardrail`` screens text for +prompt injection, but nothing catches *visual* spoofing: a Cyrillic ``"а"`` +(U+0430) is pixel-for-pixel a Latin ``"a"`` (U+0061), so ``"pаypal"`` (with a +Cyrillic ``а``) reads as ``"paypal"`` to a human yet compares unequal — the basis +of IDN-homograph phishing and lookalike UI labels. + +Following the idea of Unicode TR39, this folds confusable characters to a +prototype *skeleton* (two strings are confusable when their skeletons match) and +flags strings that mix scripts (Latin + Cyrillic). Pure standard library +(``unicodedata``); imports no ``PySide6``. Every function is pure, so it is fully +deterministic in CI. +""" +import unicodedata +from typing import Dict, List, Set, Tuple + +# Cross-script homoglyphs that NFKC does not fold. Maps each lookalike to its +# Latin/ASCII prototype. (Fullwidth, math-alphanumerics, etc. are handled by the +# NFKC pass in ``skeleton`` and need no entry here.) +_CONFUSABLES: Dict[str, str] = { + # Cyrillic lowercase + "а": "a", "е": "e", "о": "o", "р": "p", "с": "c", "у": "y", "х": "x", + "і": "i", "ј": "j", "ѕ": "s", "ԁ": "d", "һ": "h", "ѵ": "v", "ԛ": "q", + "ԝ": "w", "ё": "e", "г": "r", "п": "n", + # Cyrillic uppercase + "А": "A", "В": "B", "Е": "E", "К": "K", "М": "M", "Н": "H", "О": "O", + "Р": "P", "С": "C", "Т": "T", "Х": "X", "І": "I", "Ј": "J", "Ѕ": "S", + "У": "Y", "Ԛ": "Q", "Ԝ": "W", "Г": "r", + # Greek lowercase + "ο": "o", "α": "a", "ν": "v", "ρ": "p", "ε": "e", "ι": "i", "κ": "k", + "μ": "u", "τ": "t", "υ": "u", "χ": "x", "γ": "y", + # Greek uppercase + "Α": "A", "Β": "B", "Ε": "E", "Ζ": "Z", "Η": "H", "Ι": "I", "Κ": "K", + "Μ": "M", "Ν": "N", "Ο": "O", "Ρ": "P", "Τ": "T", "Υ": "Y", "Χ": "X", +} + +# Script blocks for mixed-script detection. Ranges are inclusive; characters +# outside every range (digits, punctuation, spaces, symbols) count as COMMON and +# are ignored when deciding whether scripts are mixed. +_SCRIPT_RANGES: Tuple[Tuple[int, int, str], ...] = ( + (0x0041, 0x005A, "LATIN"), (0x0061, 0x007A, "LATIN"), + (0x00C0, 0x024F, "LATIN"), (0x1E00, 0x1EFF, "LATIN"), + (0x0370, 0x03FF, "GREEK"), (0x1F00, 0x1FFF, "GREEK"), + (0x0400, 0x052F, "CYRILLIC"), + (0x0530, 0x058F, "ARMENIAN"), + (0x0590, 0x05FF, "HEBREW"), + (0x0600, 0x06FF, "ARABIC"), + (0x3040, 0x309F, "HIRAGANA"), (0x30A0, 0x30FF, "KATAKANA"), + (0x3400, 0x9FFF, "HAN"), (0xAC00, 0xD7AF, "HANGUL"), +) + + +def _script_of(char: str) -> str: + """Return the script name of a character (``COMMON`` if not a letter).""" + code = ord(char) + for start, end, name in _SCRIPT_RANGES: + if start <= code <= end: + return name + return "COMMON" + + +def skeleton(text: str) -> str: + """Return the confusable skeleton of ``text`` (TR39-style). + + NFKC-normalises (folding fullwidth, ligatures, math alphanumerics), then maps + each remaining cross-script homoglyph to its Latin prototype. Two strings are + confusable exactly when their skeletons are equal. + """ + normalised = unicodedata.normalize("NFKC", text or "") + return "".join(_CONFUSABLES.get(char, char) for char in normalised) + + +def is_confusable(first: str, second: str) -> bool: + """Whether two *distinct* strings render to the same skeleton.""" + return first != second and skeleton(first) == skeleton(second) + + +def detect_homoglyphs(text: str) -> List[Dict[str, object]]: + """List the confusable characters in ``text``. + + Each entry is ``{index, char, prototype}`` for a character whose skeleton + differs from itself (i.e. a cross-script lookalike). + """ + findings: List[Dict[str, object]] = [] + for index, char in enumerate(unicodedata.normalize("NFKC", text or "")): + prototype = _CONFUSABLES.get(char) + if prototype is not None: + findings.append({"index": index, "char": char, + "prototype": prototype}) + return findings + + +def scripts_of(text: str) -> Set[str]: + """Return the set of (non-common) scripts present in ``text``.""" + scripts = {_script_of(char) for char in text or ""} + scripts.discard("COMMON") + return scripts + + +def is_mixed_script(text: str) -> bool: + """Whether ``text`` mixes more than one script (a spoofing red flag).""" + return len(scripts_of(text)) > 1 diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py index d60514e0..765a8e4c 100644 --- a/je_auto_control/utils/executor/action_executor.py +++ b/je_auto_control/utils/executor/action_executor.py @@ -2976,6 +2976,23 @@ def _collation_compare(first: str, second: str, strength: str = "tertiary", tailoring=tailoring or None)} +def _confusable_scan(text: str) -> Dict[str, Any]: + """Adapter: homoglyph / mixed-script spoofing report for a string.""" + from je_auto_control.utils.confusables import ( + detect_homoglyphs, is_mixed_script, scripts_of, skeleton, + ) + return {"skeleton": skeleton(text), + "homoglyphs": detect_homoglyphs(text), + "mixed_script": is_mixed_script(text), + "scripts": sorted(scripts_of(text))} + + +def _confusable_compare(first: str, second: str) -> Dict[str, Any]: + """Adapter: whether two strings render to the same skeleton.""" + from je_auto_control.utils.confusables import is_confusable + return {"confusable": is_confusable(first, second)} + + def _cas_put(name: str, key: str, value: Any, expected_version: Any = None) -> Dict[str, Any]: """Adapter: optimistic put into a named versioned store.""" @@ -4660,6 +4677,8 @@ def __init__(self): "AC_outbox_pending": _outbox_pending, "AC_collation_sort": _collation_sort, "AC_collation_compare": _collation_compare, + "AC_confusable_scan": _confusable_scan, + "AC_confusable_compare": _confusable_compare, "AC_detect_drift": _detect_drift, "AC_categorical_drift": _categorical_drift, "AC_diff_rows": _diff_rows, diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py index c0699e5a..fe89590d 100644 --- a/je_auto_control/utils/mcp_server/tools/_factories.py +++ b/je_auto_control/utils/mcp_server/tools/_factories.py @@ -3630,6 +3630,29 @@ def locale_collation_tools() -> List[MCPTool]: ] +def confusables_tools() -> List[MCPTool]: + return [ + MCPTool( + name="ac_confusable_scan", + description=("Homoglyph / mixed-script spoofing report for 'text'. " + "Returns {skeleton, homoglyphs, mixed_script, scripts}."), + input_schema=schema({"text": {"type": "string"}}, ["text"]), + handler=h.confusable_scan, + annotations=READ_ONLY, + ), + MCPTool( + name="ac_confusable_compare", + description=("Whether 'first' and 'second' render to the same " + "confusable skeleton. Returns {confusable}."), + input_schema=schema( + {"first": {"type": "string"}, "second": {"type": "string"}}, + ["first", "second"]), + handler=h.confusable_compare, + annotations=READ_ONLY, + ), + ] + + def sequence_gap_tools() -> List[MCPTool]: return [ MCPTool( @@ -5670,7 +5693,7 @@ def media_assert_tools() -> List[MCPTool]: sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools, timeseries_tools, anomaly_tools, smoothing_tools, idempotency_tools, dedup_window_tools, sequence_gap_tools, optimistic_tools, outbox_tools, - locale_collation_tools, + locale_collation_tools, confusables_tools, dataset_diff_tools, referential_tools, link_header_tools, multipart_tools, http_content_tools, cookie_jar_tools, http_conditional_tools, saga_tools, decision_table_tools, locator_repair_tools, diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py index a73280e4..b4f65539 100644 --- a/je_auto_control/utils/mcp_server/tools/_handlers.py +++ b/je_auto_control/utils/mcp_server/tools/_handlers.py @@ -1972,6 +1972,16 @@ def collation_compare(first, second, strength="tertiary", tailoring=None): return _collation_compare(first, second, strength, tailoring) +def confusable_scan(text): + from je_auto_control.utils.executor.action_executor import _confusable_scan + return _confusable_scan(text) + + +def confusable_compare(first, second): + from je_auto_control.utils.executor.action_executor import _confusable_compare + return _confusable_compare(first, second) + + def detect_drift(reference, current, threshold=0.25, bins=10): from je_auto_control.utils.executor.action_executor import _detect_drift return _detect_drift(reference, current, threshold, bins) diff --git a/test/unit_test/headless/test_confusables_batch.py b/test/unit_test/headless/test_confusables_batch.py new file mode 100644 index 00000000..1458672f --- /dev/null +++ b/test/unit_test/headless/test_confusables_batch.py @@ -0,0 +1,66 @@ +"""Headless tests for confusable / homoglyph detection. No Qt.""" +import je_auto_control as ac +from je_auto_control.utils.confusables import ( + detect_homoglyphs, is_confusable, is_mixed_script, scripts_of, skeleton, +) + +# "pаypal" with a Cyrillic 'а' (U+0430) in place of the second letter. +_SPOOF = "pаypal" + + +def test_skeleton_folds_cyrillic_and_fullwidth(): + assert skeleton(_SPOOF) == "paypal" + assert skeleton("abc") == "abc" # fullwidth a b c via NFKC + assert skeleton("paypal") == "paypal" + + +def test_is_confusable_requires_distinct_inputs(): + assert is_confusable(_SPOOF, "paypal") is True + assert is_confusable("paypal", "paypal") is False # identical -> not flagged + assert is_confusable("apple", "orange") is False + + +def test_detect_homoglyphs_reports_position(): + findings = detect_homoglyphs(_SPOOF) + assert findings == [{"index": 1, "char": "а", "prototype": "a"}] + assert detect_homoglyphs("paypal") == [] + + +def test_mixed_script_flag(): + assert is_mixed_script(_SPOOF) is True + assert is_mixed_script("paypal") is False + assert is_mixed_script("hello world 123!") is False # COMMON ignored + + +def test_scripts_of_ignores_common(): + assert scripts_of(_SPOOF) == {"LATIN", "CYRILLIC"} + assert scripts_of("123 ...") == set() + + +# --- wiring --------------------------------------------------------------- + +def test_executor_round_trip(): + rec = ac.execute_action([["AC_confusable_scan", {"text": _SPOOF}]]) + out = next(v for v in rec.values() if isinstance(v, dict)) + assert out["skeleton"] == "paypal" and out["mixed_script"] is True + assert out["scripts"] == ["CYRILLIC", "LATIN"] + rec2 = ac.execute_action([[ + "AC_confusable_compare", {"first": _SPOOF, "second": "paypal"}]]) + assert next(v for v in rec2.values() if isinstance(v, dict))["confusable"] is True + + +def test_wiring(): + known = ac.executor.known_commands() + assert {"AC_confusable_scan", "AC_confusable_compare"} <= set(known) + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + names = {t.name for t in build_default_tool_registry()} + assert {"ac_confusable_scan", "ac_confusable_compare"} <= names + from je_auto_control.gui.script_builder.command_schema import _build_specs + specs = {s.command for s in _build_specs()} + assert {"AC_confusable_scan", "AC_confusable_compare"} <= specs + + +def test_facade_exports(): + for attr in ("confusable_skeleton", "is_confusable", "detect_homoglyphs", + "is_mixed_script", "scripts_of"): + assert hasattr(ac, attr) and attr in ac.__all__