Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-22) — Confusable / Homoglyph Detection](#whats-new-2026-06-22--confusable--homoglyph-detection)
- [What's new (2026-06-22) — Locale-Aware String Collation](#whats-new-2026-06-22--locale-aware-string-collation)
- [What's new (2026-06-22) — Transactional Outbox](#whats-new-2026-06-22--transactional-outbox)
- [What's new (2026-06-22) — Optimistic-Concurrency Versioned Store](#whats-new-2026-06-22--optimistic-concurrency-versioned-store)
Expand Down Expand Up @@ -161,6 +162,12 @@

---

## What's new (2026-06-22) — Confusable / Homoglyph Detection

Catch Unicode visual spoofing (IDN-homograph phishing, lookalike labels). Full reference: [`docs/source/Eng/doc/new_features/v109_features_doc.rst`](docs/source/Eng/doc/new_features/v109_features_doc.rst).

- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`** (`AC_confusable_scan`, `AC_confusable_compare`): a Cyrillic `"а"` is pixel-for-pixel a Latin `"a"`, so `"pаypal"` reads as `"paypal"` yet compares unequal. Following Unicode TR39, this folds confusables to a prototype skeleton (strings match when skeletons match) and flags mixed-script tokens. Pure-stdlib (`unicodedata`), deterministic.

## What's new (2026-06-22) — Locale-Aware String Collation

Sort strings the way a reader of the language expects. Full reference: [`docs/source/Eng/doc/new_features/v108_features_doc.rst`](docs/source/Eng/doc/new_features/v108_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-22) — 易混淆字符 / 同形异义字检测](#本次更新-2026-06-22--易混淆字符--同形异义字检测)
- [本次更新 (2026-06-22) — 区域感知字符串排序](#本次更新-2026-06-22--区域感知字符串排序)
- [本次更新 (2026-06-22) — 事务型 Outbox](#本次更新-2026-06-22--事务型-outbox)
- [本次更新 (2026-06-22) — 乐观并发版本存储](#本次更新-2026-06-22--乐观并发版本存储)
Expand Down Expand Up @@ -164,6 +165,12 @@

平滑噪声值序列。完整参考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。

## 本次更新 (2026-06-22) — 易混淆字符 / 同形异义字检测

抓出 Unicode 视觉仿冒(IDN 同形异义字钓鱼、仿冒标签)。完整参考:[`docs/source/Zh/doc/new_features/v109_features_doc.rst`](../docs/source/Zh/doc/new_features/v109_features_doc.rst)。

- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`**(`AC_confusable_scan`、`AC_confusable_compare`):西里尔字母 `"а"` 与拉丁字母 `"a"` 在像素上相同,因此 `"pаypal"` 读来是 `"paypal"` 却比较不相等。参照 Unicode TR39,本功能将易混淆字折叠为原型骨架(骨架相同即相符),并标记混用文字系统的令牌。纯标准库(`unicodedata`)、确定。

## 本次更新 (2026-06-22) — 区域感知字符串排序

依某语言读者的期望排序字符串。完整参考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-22) — 易混淆字元 / 同形異義字偵測](#本次更新-2026-06-22--易混淆字元--同形異義字偵測)
- [本次更新 (2026-06-22) — 地區感知字串排序](#本次更新-2026-06-22--地區感知字串排序)
- [本次更新 (2026-06-22) — 交易型 Outbox](#本次更新-2026-06-22--交易型-outbox)
- [本次更新 (2026-06-22) — 樂觀並行版本儲存](#本次更新-2026-06-22--樂觀並行版本儲存)
Expand Down Expand Up @@ -164,6 +165,12 @@

平滑雜訊值序列。完整參考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。

## 本次更新 (2026-06-22) — 易混淆字元 / 同形異義字偵測

抓出 Unicode 視覺仿冒(IDN 同形異義字釣魚、仿冒標籤)。完整參考:[`docs/source/Zh/doc/new_features/v109_features_doc.rst`](../docs/source/Zh/doc/new_features/v109_features_doc.rst)。

- **`confusable_skeleton` / `is_confusable` / `detect_homoglyphs` / `is_mixed_script` / `scripts_of`**(`AC_confusable_scan`、`AC_confusable_compare`):西里爾字母 `"а"` 與拉丁字母 `"a"` 在像素上相同,因此 `"pаypal"` 讀來是 `"paypal"` 卻比較不相等。參照 Unicode TR39,本功能將易混淆字折疊為原型骨架(骨架相同即相符),並標記混用文字系統的權杖。純標準函式庫(`unicodedata`)、具決定性。

## 本次更新 (2026-06-22) — 地區感知字串排序

依某語言讀者的期望排序字串。完整參考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。
Expand Down
45 changes: 45 additions & 0 deletions docs/source/Eng/doc/new_features/v109_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
Confusable / Homoglyph Detection
================================

``secrets_scan`` finds secret-shaped tokens and ``guardrail`` screens text for
prompt injection, but nothing catches *visual* spoofing: a Cyrillic ``"а"``
(U+0430) is pixel-for-pixel a Latin ``"a"`` (U+0061), so ``"pаypal"`` (with a
Cyrillic ``а``) reads as ``"paypal"`` to a human yet compares unequal — the basis
of IDN-homograph phishing and lookalike UI labels.

Following the idea of Unicode TR39, this folds confusable characters to a
prototype *skeleton* (two strings are confusable when their skeletons match) and
flags strings that mix scripts. Pure standard library (``unicodedata``); imports
no ``PySide6``. Every function is pure, so it is fully deterministic in CI.

Headless API
------------

.. code-block:: python

from je_auto_control import (
confusable_skeleton, is_confusable, detect_homoglyphs,
is_mixed_script, scripts_of,
)

confusable_skeleton("pаypal") # 'paypal' (Cyrillic а -> a)
is_confusable("pаypal", "paypal") # True
detect_homoglyphs("pаypal") # [{'index': 1, 'char': 'а', 'prototype': 'a'}]
is_mixed_script("pаypal") # True (Latin + Cyrillic)
scripts_of("pаypal") # {'LATIN', 'CYRILLIC'}

``confusable_skeleton`` NFKC-normalises (folding fullwidth, ligatures and math
alphanumerics) then maps each remaining cross-script lookalike to its Latin
prototype. ``is_confusable`` is true only for *distinct* strings with equal
skeletons. ``detect_homoglyphs`` returns the offending characters with their
position and prototype. ``scripts_of`` / ``is_mixed_script`` classify characters
by Unicode block (ignoring digits, punctuation and spaces) so a single mixed-
script token can be flagged on its own.

Executor commands
-----------------

``AC_confusable_scan`` returns ``{skeleton, homoglyphs, mixed_script, scripts}``
for one string; ``AC_confusable_compare`` returns ``{confusable}`` for a pair.
Both are exposed as MCP tools (``ac_confusable_scan`` / ``ac_confusable_compare``)
and as Script Builder commands under **Data**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v106_features_doc
doc/new_features/v107_features_doc
doc/new_features/v108_features_doc
doc/new_features/v109_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
38 changes: 38 additions & 0 deletions docs/source/Zh/doc/new_features/v109_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
易混淆字元 / 同形異義字偵測
==========================

``secrets_scan`` 找出疑似機密的權杖、``guardrail`` 篩檢提示注入,但沒有任何功能能抓出*視覺*仿冒:西里爾字母
``"а"``(U+0430)與拉丁字母 ``"a"``(U+0061)在像素上完全相同,因此 ``"pаypal"``(其中 ``а`` 為西里爾字母)
對人類讀來就是 ``"paypal"``,但比較卻不相等——這正是 IDN 同形異義字釣魚與仿冒 UI 標籤的根源。

本功能參照 Unicode TR39 的概念,將易混淆字元折疊為原型*骨架*(skeleton)(兩字串骨架相同即為易混淆),並標記
混用多種文字系統的字串。純標準函式庫(``unicodedata``);不匯入 ``PySide6``。每個函式皆為純函式,因此在 CI 中
完全具決定性。

無頭 API
--------

.. code-block:: python

from je_auto_control import (
confusable_skeleton, is_confusable, detect_homoglyphs,
is_mixed_script, scripts_of,
)

confusable_skeleton("pаypal") # 'paypal' (西里爾 а -> a)
is_confusable("pаypal", "paypal") # True
detect_homoglyphs("pаypal") # [{'index': 1, 'char': 'а', 'prototype': 'a'}]
is_mixed_script("pаypal") # True (拉丁 + 西里爾)
scripts_of("pаypal") # {'LATIN', 'CYRILLIC'}

``confusable_skeleton`` 先以 NFKC 正規化(折疊全形、連字與數學英數字),再將每個剩餘的跨文字系統仿冒字對映到
其拉丁原型。``is_confusable`` 僅在兩個*不同*字串骨架相同時為真。``detect_homoglyphs`` 回傳有問題的字元連同其
位置與原型。``scripts_of`` / ``is_mixed_script`` 依 Unicode 區塊將字元分類(忽略數字、標點與空白),因此可單獨
標記一個混用文字系統的權杖。

執行器命令
----------

``AC_confusable_scan`` 對單一字串回傳 ``{skeleton, homoglyphs, mixed_script, scripts}``;``AC_confusable_compare``
對一組字串回傳 ``{confusable}``。兩者皆以 MCP 工具(``ac_confusable_scan`` / ``ac_confusable_compare``)以及
Script Builder 中 **Data** 分類下的命令提供。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v106_features_doc
doc/new_features/v107_features_doc
doc/new_features/v108_features_doc
doc/new_features/v109_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,11 @@
collation_key, sort_strings,
)
from je_auto_control.utils.locale_collation import compare as collation_compare
# Confusable / homoglyph detection (Unicode-spoofing skeletons)
from je_auto_control.utils.confusables import (
detect_homoglyphs, is_confusable, is_mixed_script, scripts_of,
)
from je_auto_control.utils.confusables import skeleton as confusable_skeleton
# CI workflow annotations (GitHub Actions)
from je_auto_control.utils.ci_annotations import (
emit_annotations, format_annotation,
Expand Down Expand Up @@ -951,6 +956,11 @@ def start_autocontrol_gui(*args, **kwargs):
"collation_key",
"collation_compare",
"sort_strings",
"confusable_skeleton",
"detect_homoglyphs",
"is_confusable",
"is_mixed_script",
"scripts_of",
"emit_annotations", "format_annotation",
"ClipboardHistory", "default_clipboard_history",
"analyze_heal_log", "heal_stats", "scan_secrets",
Expand Down
15 changes: 15 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -2090,6 +2090,21 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None:
),
description="Locale-aware compare; returns order -1/0/1.",
))
specs.append(CommandSpec(
"AC_confusable_scan", "Data", "Text: Confusable Scan",
fields=(
FieldSpec("text", FieldType.STRING, placeholder="pаypal.com"),
),
description="Homoglyph / mixed-script spoofing report for a string.",
))
specs.append(CommandSpec(
"AC_confusable_compare", "Data", "Text: Confusable Compare",
fields=(
FieldSpec("first", FieldType.STRING, placeholder="paypal"),
FieldSpec("second", FieldType.STRING, placeholder="pаypal"),
),
description="Whether two strings share the same confusable skeleton.",
))
specs.append(CommandSpec(
"AC_diff_rows", "Data", "Dataset Diff: Rows by Key",
fields=(
Expand Down
9 changes: 9 additions & 0 deletions je_auto_control/utils/confusables/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Confusable / homoglyph detection (Unicode-spoofing skeletons)."""
from je_auto_control.utils.confusables.confusables import (
detect_homoglyphs, is_confusable, is_mixed_script, scripts_of, skeleton,
)

__all__ = [
"detect_homoglyphs", "is_confusable", "is_mixed_script", "scripts_of",
"skeleton",
]
103 changes: 103 additions & 0 deletions je_auto_control/utils/confusables/confusables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""Confusable / homoglyph detection (Unicode-spoofing skeletons + mixed script).

``secrets_scan`` finds secret-shaped tokens and ``guardrail`` screens text for
prompt injection, but nothing catches *visual* spoofing: a Cyrillic ``"а"``
(U+0430) is pixel-for-pixel a Latin ``"a"`` (U+0061), so ``"pаypal"`` (with a
Cyrillic ``а``) reads as ``"paypal"`` to a human yet compares unequal — the basis
of IDN-homograph phishing and lookalike UI labels.

Following the idea of Unicode TR39, this folds confusable characters to a
prototype *skeleton* (two strings are confusable when their skeletons match) and
flags strings that mix scripts (Latin + Cyrillic). Pure standard library
(``unicodedata``); imports no ``PySide6``. Every function is pure, so it is fully
deterministic in CI.
"""
import unicodedata
from typing import Dict, List, Set, Tuple

# Cross-script homoglyphs that NFKC does not fold. Maps each lookalike to its
# Latin/ASCII prototype. (Fullwidth, math-alphanumerics, etc. are handled by the
# NFKC pass in ``skeleton`` and need no entry here.)
_CONFUSABLES: Dict[str, str] = {
# Cyrillic lowercase
"а": "a", "е": "e", "о": "o", "р": "p", "с": "c", "у": "y", "х": "x",
"і": "i", "ј": "j", "ѕ": "s", "ԁ": "d", "һ": "h", "ѵ": "v", "ԛ": "q",
"ԝ": "w", "ё": "e", "г": "r", "п": "n",
# Cyrillic uppercase
"А": "A", "В": "B", "Е": "E", "К": "K", "М": "M", "Н": "H", "О": "O",
"Р": "P", "С": "C", "Т": "T", "Х": "X", "І": "I", "Ј": "J", "Ѕ": "S",
"У": "Y", "Ԛ": "Q", "Ԝ": "W", "Г": "r",
# Greek lowercase
"ο": "o", "α": "a", "ν": "v", "ρ": "p", "ε": "e", "ι": "i", "κ": "k",
"μ": "u", "τ": "t", "υ": "u", "χ": "x", "γ": "y",
# Greek uppercase
"Α": "A", "Β": "B", "Ε": "E", "Ζ": "Z", "Η": "H", "Ι": "I", "Κ": "K",
"Μ": "M", "Ν": "N", "Ο": "O", "Ρ": "P", "Τ": "T", "Υ": "Y", "Χ": "X",
}

# Script blocks for mixed-script detection. Ranges are inclusive; characters
# outside every range (digits, punctuation, spaces, symbols) count as COMMON and
# are ignored when deciding whether scripts are mixed.
_SCRIPT_RANGES: Tuple[Tuple[int, int, str], ...] = (
(0x0041, 0x005A, "LATIN"), (0x0061, 0x007A, "LATIN"),
(0x00C0, 0x024F, "LATIN"), (0x1E00, 0x1EFF, "LATIN"),
(0x0370, 0x03FF, "GREEK"), (0x1F00, 0x1FFF, "GREEK"),
(0x0400, 0x052F, "CYRILLIC"),
(0x0530, 0x058F, "ARMENIAN"),
(0x0590, 0x05FF, "HEBREW"),
(0x0600, 0x06FF, "ARABIC"),
(0x3040, 0x309F, "HIRAGANA"), (0x30A0, 0x30FF, "KATAKANA"),
(0x3400, 0x9FFF, "HAN"), (0xAC00, 0xD7AF, "HANGUL"),
)


def _script_of(char: str) -> str:
"""Return the script name of a character (``COMMON`` if not a letter)."""
code = ord(char)
for start, end, name in _SCRIPT_RANGES:
if start <= code <= end:
return name
return "COMMON"


def skeleton(text: str) -> str:
"""Return the confusable skeleton of ``text`` (TR39-style).

NFKC-normalises (folding fullwidth, ligatures, math alphanumerics), then maps
each remaining cross-script homoglyph to its Latin prototype. Two strings are
confusable exactly when their skeletons are equal.
"""
normalised = unicodedata.normalize("NFKC", text or "")
return "".join(_CONFUSABLES.get(char, char) for char in normalised)


def is_confusable(first: str, second: str) -> bool:
"""Whether two *distinct* strings render to the same skeleton."""
return first != second and skeleton(first) == skeleton(second)


def detect_homoglyphs(text: str) -> List[Dict[str, object]]:
"""List the confusable characters in ``text``.

Each entry is ``{index, char, prototype}`` for a character whose skeleton
differs from itself (i.e. a cross-script lookalike).
"""
findings: List[Dict[str, object]] = []
for index, char in enumerate(unicodedata.normalize("NFKC", text or "")):
prototype = _CONFUSABLES.get(char)
if prototype is not None:
findings.append({"index": index, "char": char,
"prototype": prototype})
return findings


def scripts_of(text: str) -> Set[str]:
"""Return the set of (non-common) scripts present in ``text``."""
scripts = {_script_of(char) for char in text or ""}
scripts.discard("COMMON")
return scripts


def is_mixed_script(text: str) -> bool:
"""Whether ``text`` mixes more than one script (a spoofing red flag)."""
return len(scripts_of(text)) > 1
19 changes: 19 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2976,6 +2976,23 @@ def _collation_compare(first: str, second: str, strength: str = "tertiary",
tailoring=tailoring or None)}


def _confusable_scan(text: str) -> Dict[str, Any]:
"""Adapter: homoglyph / mixed-script spoofing report for a string."""
from je_auto_control.utils.confusables import (
detect_homoglyphs, is_mixed_script, scripts_of, skeleton,
)
return {"skeleton": skeleton(text),
"homoglyphs": detect_homoglyphs(text),
"mixed_script": is_mixed_script(text),
"scripts": sorted(scripts_of(text))}


def _confusable_compare(first: str, second: str) -> Dict[str, Any]:
"""Adapter: whether two strings render to the same skeleton."""
from je_auto_control.utils.confusables import is_confusable
return {"confusable": is_confusable(first, second)}


def _cas_put(name: str, key: str, value: Any,
expected_version: Any = None) -> Dict[str, Any]:
"""Adapter: optimistic put into a named versioned store."""
Expand Down Expand Up @@ -4660,6 +4677,8 @@ def __init__(self):
"AC_outbox_pending": _outbox_pending,
"AC_collation_sort": _collation_sort,
"AC_collation_compare": _collation_compare,
"AC_confusable_scan": _confusable_scan,
"AC_confusable_compare": _confusable_compare,
"AC_detect_drift": _detect_drift,
"AC_categorical_drift": _categorical_drift,
"AC_diff_rows": _diff_rows,
Expand Down
25 changes: 24 additions & 1 deletion je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -3630,6 +3630,29 @@ def locale_collation_tools() -> List[MCPTool]:
]


def confusables_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_confusable_scan",
description=("Homoglyph / mixed-script spoofing report for 'text'. "
"Returns {skeleton, homoglyphs, mixed_script, scripts}."),
input_schema=schema({"text": {"type": "string"}}, ["text"]),
handler=h.confusable_scan,
annotations=READ_ONLY,
),
MCPTool(
name="ac_confusable_compare",
description=("Whether 'first' and 'second' render to the same "
"confusable skeleton. Returns {confusable}."),
input_schema=schema(
{"first": {"type": "string"}, "second": {"type": "string"}},
["first", "second"]),
handler=h.confusable_compare,
annotations=READ_ONLY,
),
]


def sequence_gap_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -5670,7 +5693,7 @@ def media_assert_tools() -> List[MCPTool]:
sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools,
timeseries_tools, anomaly_tools, smoothing_tools, idempotency_tools,
dedup_window_tools, sequence_gap_tools, optimistic_tools, outbox_tools,
locale_collation_tools,
locale_collation_tools, confusables_tools,
dataset_diff_tools, referential_tools, link_header_tools, multipart_tools,
http_content_tools, cookie_jar_tools, http_conditional_tools,
saga_tools, decision_table_tools, locator_repair_tools,
Expand Down
Loading
Loading