diff --git a/README.md b/README.md index c290b819..c5bb6a50 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ ## Table of Contents +- [What's new (2026-06-22) — Unicode Text Normalisation & Slugify](#whats-new-2026-06-22--unicode-text-normalisation--slugify) - [What's new (2026-06-22) — JSON-Schema Compatibility Checking](#whats-new-2026-06-22--json-schema-compatibility-checking) - [What's new (2026-06-22) — Typed Configuration Schema](#whats-new-2026-06-22--typed-configuration-schema) - [What's new (2026-06-22) — OTLP/JSON Span Export](#whats-new-2026-06-22--otlpjson-span-export) @@ -149,6 +150,12 @@ --- +## What's new (2026-06-22) — Unicode Text Normalisation & Slugify + +Canonicalize text before fuzzy/search/OCR matching. Full reference: [`docs/source/Eng/doc/new_features/v97_features_doc.rst`](docs/source/Eng/doc/new_features/v97_features_doc.rst). + +- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`** (`AC_normalize_text`, `AC_slugify`): `fuzzy` and `search_index.tokenize` only lowercase and OCR matching only `.lower()`+substring, so `"Café"` (NFC) vs `"Café"` (NFD) vs `"cafe"` compare unequal. This adds the missing canonicalization layer (NFKC + casefold + whitespace fold, accent stripping, smart-quote mapping, ASCII slugs). Pure-stdlib (`unicodedata`), deterministic. + ## What's new (2026-06-22) — JSON-Schema Compatibility Checking Classify schema changes as backward/forward/full. Full reference: [`docs/source/Eng/doc/new_features/v96_features_doc.rst`](docs/source/Eng/doc/new_features/v96_features_doc.rst). diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md index 7cb50319..8aa13d5d 100644 --- a/README/README_zh-CN.md +++ b/README/README_zh-CN.md @@ -12,6 +12,7 @@ ## 目录 +- [本次更新 (2026-06-22) — Unicode 文本规范化与 Slug](#本次更新-2026-06-22--unicode-文本规范化与-slug) - [本次更新 (2026-06-22) — JSON-Schema 兼容性检查](#本次更新-2026-06-22--json-schema-兼容性检查) - [本次更新 (2026-06-22) — 具类型的配置结构](#本次更新-2026-06-22--具类型的配置结构) - [本次更新 (2026-06-22) — OTLP/JSON Span 导出](#本次更新-2026-06-22--otlpjson-span-导出) @@ -148,6 +149,12 @@ --- +## 本次更新 (2026-06-22) — Unicode 文本规范化与 Slug + +在 fuzzy/search/OCR 匹配前规范化文本。完整参考:[`docs/source/Zh/doc/new_features/v97_features_doc.rst`](../docs/source/Zh/doc/new_features/v97_features_doc.rst)。 + +- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`**(`AC_normalize_text`、`AC_slugify`):`fuzzy` 与 `search_index.tokenize` 只做小写,OCR 匹配只做 `.lower()`+子串,因此 `"Café"`(NFC)、`"Café"`(NFD)、`"cafe"` 会匹配不相等。本功能补上缺少的规范化层(NFKC + casefold + 空白折叠、去重音、智能引号映射、ASCII slug)。纯标准库(`unicodedata`)、确定。 + ## 本次更新 (2026-06-22) — JSON-Schema 兼容性检查 把结构变更分类为 backward/forward/full。完整参考:[`docs/source/Zh/doc/new_features/v96_features_doc.rst`](../docs/source/Zh/doc/new_features/v96_features_doc.rst)。 diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md index f5f9da23..3848c564 100644 --- a/README/README_zh-TW.md +++ b/README/README_zh-TW.md @@ -12,6 +12,7 @@ ## 目錄 +- [本次更新 (2026-06-22) — Unicode 文字正規化與 Slug](#本次更新-2026-06-22--unicode-文字正規化與-slug) - [本次更新 (2026-06-22) — JSON-Schema 相容性檢查](#本次更新-2026-06-22--json-schema-相容性檢查) - [本次更新 (2026-06-22) — 具型別的設定結構](#本次更新-2026-06-22--具型別的設定結構) - [本次更新 (2026-06-22) — OTLP/JSON Span 匯出](#本次更新-2026-06-22--otlpjson-span-匯出) @@ -148,6 +149,12 @@ --- +## 本次更新 (2026-06-22) — Unicode 文字正規化與 Slug + +在 fuzzy/search/OCR 比對前正規化文字。完整參考:[`docs/source/Zh/doc/new_features/v97_features_doc.rst`](../docs/source/Zh/doc/new_features/v97_features_doc.rst)。 + +- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`**(`AC_normalize_text`、`AC_slugify`):`fuzzy` 與 `search_index.tokenize` 只做小寫,OCR 比對只做 `.lower()`+子字串,因此 `"Café"`(NFC)、`"Café"`(NFD)、`"cafe"` 會比對不相等。本功能補上缺少的正規化層(NFKC + casefold + 空白折疊、去重音、智慧引號對應、ASCII slug)。純標準函式庫(`unicodedata`)、具決定性。 + ## 本次更新 (2026-06-22) — JSON-Schema 相容性檢查 把結構變更分類為 backward/forward/full。完整參考:[`docs/source/Zh/doc/new_features/v96_features_doc.rst`](../docs/source/Zh/doc/new_features/v96_features_doc.rst)。 diff --git a/docs/source/Eng/doc/new_features/v97_features_doc.rst b/docs/source/Eng/doc/new_features/v97_features_doc.rst new file mode 100644 index 00000000..8550f360 --- /dev/null +++ b/docs/source/Eng/doc/new_features/v97_features_doc.rst @@ -0,0 +1,40 @@ +Unicode Text Normalisation & Slugify +==================================== + +``fuzzy`` and ``search_index.tokenize`` only lowercase, and OCR +``find_text_matches`` only ``.lower()`` + substring — so ``"Café"`` (NFC) versus +``"Café"`` (NFD) versus OCR ``"cafe"`` compare unequal. This adds the +canonicalisation layer they should run before matching. + +Pure standard library (``unicodedata`` / ``re``); imports no ``PySide6``. Every +function is pure (text in, text out), so it is fully deterministic in CI. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import ( + normalize_text, deaccent, slugify, normalize_quotes, fold_whitespace, + ) + + normalize_text("CAFÉ Menu") # "café menu" (NFKC + casefold + ws) + deaccent("résumé") # "resume" + slugify("Café Menu! 2026") # "cafe-menu-2026" + normalize_quotes("“Hi” — it’s…") # '"Hi" - it\'s...' + +``normalize_text`` applies a Unicode ``form`` (default ``NFKC``), optional +casefolding, and whitespace folding, so the same text in different code-point +forms compares equal. ``deaccent`` strips combining marks; ``fold_whitespace`` +collapses runs to single spaces; ``normalize_quotes`` maps smart quotes, dashes, +ellipsis and NBSP to ASCII; ``slugify`` produces an ASCII slug (de-accent, +lowercase, join alphanumeric runs with a separator). Run ``normalize_text`` +before fuzzy/search/OCR matching to make matches accent- and form-insensitive. + +Executor commands +----------------- + +``AC_normalize_text`` returns ``{text}`` (with optional ``form`` / ``casefold`` +/ ``collapse_ws``); ``AC_slugify`` returns ``{slug}``. Both are exposed as MCP +tools (``ac_normalize_text`` / ``ac_slugify``) and as Script Builder commands +under **Data**. diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst index 30b4d9fc..5e52e253 100644 --- a/docs/source/Eng/eng_index.rst +++ b/docs/source/Eng/eng_index.rst @@ -119,6 +119,7 @@ Comprehensive guides for all AutoControl features. doc/new_features/v94_features_doc doc/new_features/v95_features_doc doc/new_features/v96_features_doc + doc/new_features/v97_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/docs/source/Zh/doc/new_features/v97_features_doc.rst b/docs/source/Zh/doc/new_features/v97_features_doc.rst new file mode 100644 index 00000000..eb2bfcc7 --- /dev/null +++ b/docs/source/Zh/doc/new_features/v97_features_doc.rst @@ -0,0 +1,35 @@ +Unicode 文字正規化與 Slug +======================== + +``fuzzy`` 與 ``search_index.tokenize`` 只做小寫化,OCR ``find_text_matches`` 只做 ``.lower()`` + 子字串 +比對 —— 因此 ``"Café"``(NFC)、``"Café"``(NFD)與 OCR 的 ``"cafe"`` 會比對不相等。本功能補上它們在比對 +前應執行的正規化層。 + +純標準函式庫(``unicodedata`` / ``re``);不匯入 ``PySide6``。每個函式皆為純函式(輸入文字、輸出文字), +因此在 CI 中完全具決定性。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import ( + normalize_text, deaccent, slugify, normalize_quotes, fold_whitespace, + ) + + normalize_text("CAFÉ Menu") # "café menu"(NFKC + casefold + 空白) + deaccent("résumé") # "resume" + slugify("Café Menu! 2026") # "cafe-menu-2026" + normalize_quotes("“Hi” — it’s…") # '"Hi" - it\'s...' + +``normalize_text`` 套用 Unicode ``form``(預設 ``NFKC``)、選用 casefold 與空白折疊,讓不同碼點形式的相同 +文字比對相等。``deaccent`` 去除組合附加符號;``fold_whitespace`` 把連續空白收成單一空格;``normalize_quotes`` +把智慧引號、破折號、省略號與 NBSP 對應成 ASCII;``slugify`` 產生 ASCII slug(去重音、小寫、以分隔符連接 +英數段)。在 fuzzy/search/OCR 比對前先執行 ``normalize_text`` 可讓比對對重音與形式不敏感。 + +執行器命令 +---------- + +``AC_normalize_text`` 回傳 ``{text}``(可選 ``form`` / ``casefold`` / ``collapse_ws``);``AC_slugify`` 回傳 +``{slug}``。兩者皆以 MCP 工具(``ac_normalize_text`` / ``ac_slugify``)以及 Script Builder 中 **Data** 分類下 +的命令提供。 diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst index d7980a97..ed23b32a 100644 --- a/docs/source/Zh/zh_index.rst +++ b/docs/source/Zh/zh_index.rst @@ -119,6 +119,7 @@ AutoControl 所有功能的完整使用指南。 doc/new_features/v94_features_doc doc/new_features/v95_features_doc doc/new_features/v96_features_doc + doc/new_features/v97_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py index d492b9b6..0c0985b2 100644 --- a/je_auto_control/__init__.py +++ b/je_auto_control/__init__.py @@ -253,6 +253,10 @@ from je_auto_control.utils.fuzzy import ( fuzzy_best_match, fuzzy_dedupe, fuzzy_matches, fuzzy_ratio, ) +# Unicode text normalisation + slugify (canonicalise before matching) +from je_auto_control.utils.text_normalize import ( + deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify, +) # S3-compatible artifact store (optional boto3, injectable client) from je_auto_control.utils.artifact_store import ( S3ArtifactStore, configure_default_store, get_default_store, @@ -917,6 +921,8 @@ def start_autocontrol_gui(*args, **kwargs): "VideoStep", "build_overlay_plan", "render_overlay_frame", "write_step_video", "fuzzy_best_match", "fuzzy_dedupe", "fuzzy_matches", "fuzzy_ratio", + "deaccent", "fold_whitespace", "normalize_quotes", "normalize_text", + "slugify", "S3ArtifactStore", "configure_default_store", "get_default_store", "set_default_store", "average_hash", "dedupe_images", "dhash", "hamming_distance", diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py index b988b7d7..8a194672 100644 --- a/je_auto_control/gui/script_builder/command_schema.py +++ b/je_auto_control/gui/script_builder/command_schema.py @@ -1648,6 +1648,26 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None: ), description="Build a canonical wide-event log line (rendered as JSON).", )) + specs.append(CommandSpec( + "AC_normalize_text", "Data", "Text: Normalize (Unicode)", + fields=( + FieldSpec("text", FieldType.STRING, placeholder="Café Menu"), + FieldSpec("form", FieldType.STRING, optional=True, + placeholder="NFKC"), + FieldSpec("casefold", FieldType.BOOL, optional=True, default=True), + FieldSpec("collapse_ws", FieldType.BOOL, optional=True, + default=True), + ), + description="Unicode-normalise text (form + casefold + ws fold).", + )) + specs.append(CommandSpec( + "AC_slugify", "Data", "Text: Slugify", + fields=( + FieldSpec("text", FieldType.STRING, placeholder="Café Menu!"), + FieldSpec("sep", FieldType.STRING, optional=True, placeholder="-"), + ), + description="Produce an ASCII slug (de-accent, lowercase, join).", + )) specs.append(CommandSpec( "AC_spans_to_otlp", "Report", "OTLP: Export Spans", fields=( diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py index eeab8f07..d459705f 100644 --- a/je_auto_control/utils/executor/action_executor.py +++ b/je_auto_control/utils/executor/action_executor.py @@ -3138,6 +3138,20 @@ def _baggage_parse(header: str) -> Dict[str, Any]: return {"items": parse_baggage(header).to_dict()} +def _normalize_text(text: str, form: str = "NFKC", casefold: Any = True, + collapse_ws: Any = True) -> Dict[str, Any]: + """Adapter: Unicode-normalise text into {text}.""" + from je_auto_control.utils.text_normalize import normalize_text + return {"text": normalize_text(text, form=form, casefold=bool(casefold), + collapse_ws=bool(collapse_ws))} + + +def _slugify(text: str, sep: str = "-") -> Dict[str, Any]: + """Adapter: produce an ASCII slug from text.""" + from je_auto_control.utils.text_normalize import slugify + return {"slug": slugify(text, sep=sep)} + + def _canonical_log(fields: Any) -> Dict[str, Any]: """Adapter: build a canonical log line from a fields dict.""" import json @@ -4424,6 +4438,8 @@ def __init__(self): "AC_baggage_format": _baggage_format, "AC_canonical_log": _canonical_log, "AC_spans_to_otlp": _spans_to_otlp, + "AC_normalize_text": _normalize_text, + "AC_slugify": _slugify, "AC_validate_config": _validate_config, "AC_resolve_ref": _resolve_ref, "AC_resolve_refs": _resolve_refs, diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py index 81d02516..c19fa56f 100644 --- a/je_auto_control/utils/mcp_server/tools/_factories.py +++ b/je_auto_control/utils/mcp_server/tools/_factories.py @@ -3782,6 +3782,35 @@ def otlp_export_tools() -> List[MCPTool]: ] +def text_normalize_tools() -> List[MCPTool]: + return [ + MCPTool( + name="ac_normalize_text", + description=("Unicode-normalise 'text' (form NFKC/NFC/..., casefold, " + "collapse whitespace) for robust matching. Returns " + "{text}."), + input_schema=schema( + {"text": {"type": "string"}, "form": {"type": "string"}, + "casefold": {"type": "boolean"}, + "collapse_ws": {"type": "boolean"}}, + ["text"]), + handler=h.normalize_text, + annotations=READ_ONLY, + ), + MCPTool( + name="ac_slugify", + description=("Produce an ASCII slug from 'text' (de-accent, " + "lowercase, join alnum runs with 'sep'). Returns " + "{slug}."), + input_schema=schema( + {"text": {"type": "string"}, "sep": {"type": "string"}}, + ["text"]), + handler=h.slugify, + annotations=READ_ONLY, + ), + ] + + def canonical_log_tools() -> List[MCPTool]: return [ MCPTool( @@ -5378,6 +5407,7 @@ def media_assert_tools() -> List[MCPTool]: feature_flag_tools, provenance_tools, json_contract_tools, chaos_tools, slo_tools, percentiles_tools, bulkhead_tools, http_cassette_tools, trace_context_tools, baggage_tools, canonical_log_tools, otlp_export_tools, + text_normalize_tools, secret_ref_tools, config_schema_tools, config_redaction_tools, data_profile_tools, http_problem_tools, dotenv_tools, sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools, diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py index 8b9be0d5..9c68be9d 100644 --- a/je_auto_control/utils/mcp_server/tools/_handlers.py +++ b/je_auto_control/utils/mcp_server/tools/_handlers.py @@ -1736,6 +1736,16 @@ def baggage_format(items): return _baggage_format(items) +def normalize_text(text, form="NFKC", casefold=True, collapse_ws=True): + from je_auto_control.utils.executor.action_executor import _normalize_text + return _normalize_text(text, form, casefold, collapse_ws) + + +def slugify(text, sep="-"): + from je_auto_control.utils.executor.action_executor import _slugify + return _slugify(text, sep) + + def canonical_log(fields): from je_auto_control.utils.executor.action_executor import _canonical_log return _canonical_log(fields) diff --git a/je_auto_control/utils/text_normalize/__init__.py b/je_auto_control/utils/text_normalize/__init__.py new file mode 100644 index 00000000..5ec8cdaa --- /dev/null +++ b/je_auto_control/utils/text_normalize/__init__.py @@ -0,0 +1,9 @@ +"""Unicode text normalisation and slug generation for AutoControl.""" +from je_auto_control.utils.text_normalize.text_normalize import ( + deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify, +) + +__all__ = [ + "deaccent", "fold_whitespace", "normalize_quotes", "normalize_text", + "slugify", +] diff --git a/je_auto_control/utils/text_normalize/text_normalize.py b/je_auto_control/utils/text_normalize/text_normalize.py new file mode 100644 index 00000000..628ca6b6 --- /dev/null +++ b/je_auto_control/utils/text_normalize/text_normalize.py @@ -0,0 +1,54 @@ +"""Unicode text normalisation and slug generation for robust text matching. + +``fuzzy`` and ``search_index.tokenize`` only lowercase, and OCR +``find_text_matches`` only ``.lower()`` + substring — so ``"Café"`` (NFC) versus +``"Café"`` (NFD) versus OCR ``"cafe"`` compare unequal. This is the +canonicalisation layer they should run before matching. + +Pure standard library (``unicodedata`` / ``re``); imports no ``PySide6``. Every +function is pure (text in, text out), so it is fully deterministic in CI. +""" +import re +import unicodedata + +_QUOTE_MAP = { + "‘": "'", "’": "'", "‚": "'", "‛": "'", + "“": '"', "”": '"', "„": '"', "‟": '"', + "–": "-", "—": "-", "−": "-", "…": "...", + " ": " ", +} +_QUOTE_TABLE = str.maketrans(_QUOTE_MAP) + + +def fold_whitespace(text: str) -> str: + """Collapse runs of whitespace to single spaces and strip the ends.""" + return " ".join((text or "").split()) + + +def deaccent(text: str) -> str: + """Strip combining diacritical marks (``café`` -> ``cafe``).""" + decomposed = unicodedata.normalize("NFD", text or "") + return "".join(ch for ch in decomposed if not unicodedata.combining(ch)) + + +def normalize_quotes(text: str) -> str: + """Replace smart quotes, dashes, ellipsis and NBSP with ASCII equivalents.""" + return (text or "").translate(_QUOTE_TABLE) + + +def normalize_text(text: str, *, form: str = "NFKC", casefold: bool = True, + collapse_ws: bool = True) -> str: + """Canonicalise ``text``: Unicode ``form``, optional casefold + ws fold.""" + result = unicodedata.normalize(form, text or "") + if casefold: + result = result.casefold() + if collapse_ws: + result = fold_whitespace(result) + return result + + +def slugify(text: str, *, sep: str = "-") -> str: + """Produce an ASCII slug: de-accent, lowercase, join alnum runs with ``sep``.""" + base = deaccent(unicodedata.normalize("NFKC", text or "")).lower() + slug = re.sub(r"[^a-z0-9]+", sep, base) + return slug.strip(sep) if sep else slug diff --git a/test/unit_test/headless/test_text_normalize_batch.py b/test/unit_test/headless/test_text_normalize_batch.py new file mode 100644 index 00000000..e217eb27 --- /dev/null +++ b/test/unit_test/headless/test_text_normalize_batch.py @@ -0,0 +1,78 @@ +"""Headless tests for Unicode text normalisation. Pure stdlib, no Qt.""" +import unicodedata + +import je_auto_control as ac +from je_auto_control.utils.text_normalize import ( + deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify, +) + + +def test_nfc_nfd_match_after_normalize(): + nfc = "Café" # é as one code point + nfd = "Café" # e + combining acute + assert nfc != nfd + assert normalize_text(nfc) == normalize_text(nfd) + assert normalize_text(nfc) == "café" # casefolded, NFKC + + +def test_deaccent(): + assert deaccent("Café résumé naïve") == "Cafe resume naive" + assert deaccent("") == "" + + +def test_fold_whitespace(): + assert fold_whitespace(" a\t b\n c ") == "a b c" + + +def test_normalize_quotes(): + assert normalize_quotes("“Hi” — it’s here…") == \ + '"Hi" - it\'s here...' + + +def test_normalize_text_options(): + assert normalize_text("AbC", casefold=False, collapse_ws=False) == "AbC" + assert normalize_text("A B") == "a b" + # NFKC folds fullwidth/compatibility forms + assert normalize_text("AB", casefold=False) == "AB" + + +def test_slugify(): + assert slugify("Café Menu! 2026") == "cafe-menu-2026" + assert slugify(" multiple spaces ") == "multiple-spaces" + assert slugify("Hello World", sep="_") == "hello_world" + assert slugify("!!!") == "" + + +def test_normalize_helps_matching(): + # the canonicalisation layer fuzzy/search/OCR lack + on_screen = normalize_text("CAFÉ") + ocr = normalize_text(unicodedata.normalize("NFD", "café")) + assert on_screen == ocr + + +# --- wiring --------------------------------------------------------------- + +def test_executor_round_trip(): + rec = ac.execute_action([["AC_normalize_text", {"text": "Café Menu"}]]) + assert next(v for v in rec.values() + if isinstance(v, dict))["text"] == "café menu" + rec2 = ac.execute_action([["AC_slugify", {"text": "Café Menu!"}]]) + assert next(v for v in rec2.values() + if isinstance(v, dict))["slug"] == "cafe-menu" + + +def test_wiring(): + known = ac.executor.known_commands() + assert {"AC_normalize_text", "AC_slugify"} <= set(known) + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + names = {t.name for t in build_default_tool_registry()} + assert {"ac_normalize_text", "ac_slugify"} <= names + from je_auto_control.gui.script_builder.command_schema import _build_specs + specs = {s.command for s in _build_specs()} + assert {"AC_normalize_text", "AC_slugify"} <= specs + + +def test_facade_exports(): + for attr in ("normalize_text", "deaccent", "slugify", "normalize_quotes", + "fold_whitespace"): + assert hasattr(ac, attr) and attr in ac.__all__