Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-22) — Unicode Text Normalisation & Slugify](#whats-new-2026-06-22--unicode-text-normalisation--slugify)
- [What's new (2026-06-22) — JSON-Schema Compatibility Checking](#whats-new-2026-06-22--json-schema-compatibility-checking)
- [What's new (2026-06-22) — Typed Configuration Schema](#whats-new-2026-06-22--typed-configuration-schema)
- [What's new (2026-06-22) — OTLP/JSON Span Export](#whats-new-2026-06-22--otlpjson-span-export)
Expand Down Expand Up @@ -149,6 +150,12 @@

---

## What's new (2026-06-22) — Unicode Text Normalisation & Slugify

Canonicalize text before fuzzy/search/OCR matching. Full reference: [`docs/source/Eng/doc/new_features/v97_features_doc.rst`](docs/source/Eng/doc/new_features/v97_features_doc.rst).

- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`** (`AC_normalize_text`, `AC_slugify`): `fuzzy` and `search_index.tokenize` only lowercase and OCR matching only `.lower()`+substring, so `"Café"` (NFC) vs `"Café"` (NFD) vs `"cafe"` compare unequal. This adds the missing canonicalization layer (NFKC + casefold + whitespace fold, accent stripping, smart-quote mapping, ASCII slugs). Pure-stdlib (`unicodedata`), deterministic.

## What's new (2026-06-22) — JSON-Schema Compatibility Checking

Classify schema changes as backward/forward/full. Full reference: [`docs/source/Eng/doc/new_features/v96_features_doc.rst`](docs/source/Eng/doc/new_features/v96_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-22) — Unicode 文本规范化与 Slug](#本次更新-2026-06-22--unicode-文本规范化与-slug)
- [本次更新 (2026-06-22) — JSON-Schema 兼容性检查](#本次更新-2026-06-22--json-schema-兼容性检查)
- [本次更新 (2026-06-22) — 具类型的配置结构](#本次更新-2026-06-22--具类型的配置结构)
- [本次更新 (2026-06-22) — OTLP/JSON Span 导出](#本次更新-2026-06-22--otlpjson-span-导出)
Expand Down Expand Up @@ -148,6 +149,12 @@

---

## 本次更新 (2026-06-22) — Unicode 文本规范化与 Slug

在 fuzzy/search/OCR 匹配前规范化文本。完整参考:[`docs/source/Zh/doc/new_features/v97_features_doc.rst`](../docs/source/Zh/doc/new_features/v97_features_doc.rst)。

- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`**(`AC_normalize_text`、`AC_slugify`):`fuzzy` 与 `search_index.tokenize` 只做小写,OCR 匹配只做 `.lower()`+子串,因此 `"Café"`(NFC)、`"Café"`(NFD)、`"cafe"` 会匹配不相等。本功能补上缺少的规范化层(NFKC + casefold + 空白折叠、去重音、智能引号映射、ASCII slug)。纯标准库(`unicodedata`)、确定。

## 本次更新 (2026-06-22) — JSON-Schema 兼容性检查

把结构变更分类为 backward/forward/full。完整参考:[`docs/source/Zh/doc/new_features/v96_features_doc.rst`](../docs/source/Zh/doc/new_features/v96_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-22) — Unicode 文字正規化與 Slug](#本次更新-2026-06-22--unicode-文字正規化與-slug)
- [本次更新 (2026-06-22) — JSON-Schema 相容性檢查](#本次更新-2026-06-22--json-schema-相容性檢查)
- [本次更新 (2026-06-22) — 具型別的設定結構](#本次更新-2026-06-22--具型別的設定結構)
- [本次更新 (2026-06-22) — OTLP/JSON Span 匯出](#本次更新-2026-06-22--otlpjson-span-匯出)
Expand Down Expand Up @@ -148,6 +149,12 @@

---

## 本次更新 (2026-06-22) — Unicode 文字正規化與 Slug

在 fuzzy/search/OCR 比對前正規化文字。完整參考:[`docs/source/Zh/doc/new_features/v97_features_doc.rst`](../docs/source/Zh/doc/new_features/v97_features_doc.rst)。

- **`normalize_text` / `deaccent` / `slugify` / `normalize_quotes` / `fold_whitespace`**(`AC_normalize_text`、`AC_slugify`):`fuzzy` 與 `search_index.tokenize` 只做小寫,OCR 比對只做 `.lower()`+子字串,因此 `"Café"`(NFC)、`"Café"`(NFD)、`"cafe"` 會比對不相等。本功能補上缺少的正規化層(NFKC + casefold + 空白折疊、去重音、智慧引號對應、ASCII slug)。純標準函式庫(`unicodedata`)、具決定性。

## 本次更新 (2026-06-22) — JSON-Schema 相容性檢查

把結構變更分類為 backward/forward/full。完整參考:[`docs/source/Zh/doc/new_features/v96_features_doc.rst`](../docs/source/Zh/doc/new_features/v96_features_doc.rst)。
Expand Down
40 changes: 40 additions & 0 deletions docs/source/Eng/doc/new_features/v97_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
Unicode Text Normalisation & Slugify
====================================

``fuzzy`` and ``search_index.tokenize`` only lowercase, and OCR
``find_text_matches`` only ``.lower()`` + substring — so ``"Café"`` (NFC) versus
``"Café"`` (NFD) versus OCR ``"cafe"`` compare unequal. This adds the
canonicalisation layer they should run before matching.

Pure standard library (``unicodedata`` / ``re``); imports no ``PySide6``. Every
function is pure (text in, text out), so it is fully deterministic in CI.

Headless API
------------

.. code-block:: python

from je_auto_control import (
normalize_text, deaccent, slugify, normalize_quotes, fold_whitespace,
)

normalize_text("CAFÉ Menu") # "café menu" (NFKC + casefold + ws)
deaccent("résumé") # "resume"
slugify("Café Menu! 2026") # "cafe-menu-2026"
normalize_quotes("“Hi” — it’s…") # '"Hi" - it\'s...'

``normalize_text`` applies a Unicode ``form`` (default ``NFKC``), optional
casefolding, and whitespace folding, so the same text in different code-point
forms compares equal. ``deaccent`` strips combining marks; ``fold_whitespace``
collapses runs to single spaces; ``normalize_quotes`` maps smart quotes, dashes,
ellipsis and NBSP to ASCII; ``slugify`` produces an ASCII slug (de-accent,
lowercase, join alphanumeric runs with a separator). Run ``normalize_text``
before fuzzy/search/OCR matching to make matches accent- and form-insensitive.

Executor commands
-----------------

``AC_normalize_text`` returns ``{text}`` (with optional ``form`` / ``casefold``
/ ``collapse_ws``); ``AC_slugify`` returns ``{slug}``. Both are exposed as MCP
tools (``ac_normalize_text`` / ``ac_slugify``) and as Script Builder commands
under **Data**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v94_features_doc
doc/new_features/v95_features_doc
doc/new_features/v96_features_doc
doc/new_features/v97_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
35 changes: 35 additions & 0 deletions docs/source/Zh/doc/new_features/v97_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Unicode 文字正規化與 Slug
========================

``fuzzy`` 與 ``search_index.tokenize`` 只做小寫化,OCR ``find_text_matches`` 只做 ``.lower()`` + 子字串
比對 —— 因此 ``"Café"``(NFC)、``"Café"``(NFD)與 OCR 的 ``"cafe"`` 會比對不相等。本功能補上它們在比對
前應執行的正規化層。

純標準函式庫(``unicodedata`` / ``re``);不匯入 ``PySide6``。每個函式皆為純函式(輸入文字、輸出文字),
因此在 CI 中完全具決定性。

無頭 API
--------

.. code-block:: python

from je_auto_control import (
normalize_text, deaccent, slugify, normalize_quotes, fold_whitespace,
)

normalize_text("CAFÉ Menu") # "café menu"(NFKC + casefold + 空白)
deaccent("résumé") # "resume"
slugify("Café Menu! 2026") # "cafe-menu-2026"
normalize_quotes("“Hi” — it’s…") # '"Hi" - it\'s...'

``normalize_text`` 套用 Unicode ``form``(預設 ``NFKC``)、選用 casefold 與空白折疊,讓不同碼點形式的相同
文字比對相等。``deaccent`` 去除組合附加符號;``fold_whitespace`` 把連續空白收成單一空格;``normalize_quotes``
把智慧引號、破折號、省略號與 NBSP 對應成 ASCII;``slugify`` 產生 ASCII slug(去重音、小寫、以分隔符連接
英數段)。在 fuzzy/search/OCR 比對前先執行 ``normalize_text`` 可讓比對對重音與形式不敏感。

執行器命令
----------

``AC_normalize_text`` 回傳 ``{text}``(可選 ``form`` / ``casefold`` / ``collapse_ws``);``AC_slugify`` 回傳
``{slug}``。兩者皆以 MCP 工具(``ac_normalize_text`` / ``ac_slugify``)以及 Script Builder 中 **Data** 分類下
的命令提供。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v94_features_doc
doc/new_features/v95_features_doc
doc/new_features/v96_features_doc
doc/new_features/v97_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,10 @@
from je_auto_control.utils.fuzzy import (
fuzzy_best_match, fuzzy_dedupe, fuzzy_matches, fuzzy_ratio,
)
# Unicode text normalisation + slugify (canonicalise before matching)
from je_auto_control.utils.text_normalize import (
deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify,
)
# S3-compatible artifact store (optional boto3, injectable client)
from je_auto_control.utils.artifact_store import (
S3ArtifactStore, configure_default_store, get_default_store,
Expand Down Expand Up @@ -917,6 +921,8 @@ def start_autocontrol_gui(*args, **kwargs):
"VideoStep", "build_overlay_plan", "render_overlay_frame",
"write_step_video",
"fuzzy_best_match", "fuzzy_dedupe", "fuzzy_matches", "fuzzy_ratio",
"deaccent", "fold_whitespace", "normalize_quotes", "normalize_text",
"slugify",
"S3ArtifactStore", "configure_default_store", "get_default_store",
"set_default_store",
"average_hash", "dedupe_images", "dhash", "hamming_distance",
Expand Down
20 changes: 20 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1648,6 +1648,26 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None:
),
description="Build a canonical wide-event log line (rendered as JSON).",
))
specs.append(CommandSpec(
"AC_normalize_text", "Data", "Text: Normalize (Unicode)",
fields=(
FieldSpec("text", FieldType.STRING, placeholder="Café Menu"),
FieldSpec("form", FieldType.STRING, optional=True,
placeholder="NFKC"),
FieldSpec("casefold", FieldType.BOOL, optional=True, default=True),
FieldSpec("collapse_ws", FieldType.BOOL, optional=True,
default=True),
),
description="Unicode-normalise text (form + casefold + ws fold).",
))
specs.append(CommandSpec(
"AC_slugify", "Data", "Text: Slugify",
fields=(
FieldSpec("text", FieldType.STRING, placeholder="Café Menu!"),
FieldSpec("sep", FieldType.STRING, optional=True, placeholder="-"),
),
description="Produce an ASCII slug (de-accent, lowercase, join).",
))
specs.append(CommandSpec(
"AC_spans_to_otlp", "Report", "OTLP: Export Spans",
fields=(
Expand Down
16 changes: 16 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3138,6 +3138,20 @@ def _baggage_parse(header: str) -> Dict[str, Any]:
return {"items": parse_baggage(header).to_dict()}


def _normalize_text(text: str, form: str = "NFKC", casefold: Any = True,
collapse_ws: Any = True) -> Dict[str, Any]:
"""Adapter: Unicode-normalise text into {text}."""
from je_auto_control.utils.text_normalize import normalize_text
return {"text": normalize_text(text, form=form, casefold=bool(casefold),
collapse_ws=bool(collapse_ws))}


def _slugify(text: str, sep: str = "-") -> Dict[str, Any]:
"""Adapter: produce an ASCII slug from text."""
from je_auto_control.utils.text_normalize import slugify
return {"slug": slugify(text, sep=sep)}


def _canonical_log(fields: Any) -> Dict[str, Any]:
"""Adapter: build a canonical log line from a fields dict."""
import json
Expand Down Expand Up @@ -4424,6 +4438,8 @@ def __init__(self):
"AC_baggage_format": _baggage_format,
"AC_canonical_log": _canonical_log,
"AC_spans_to_otlp": _spans_to_otlp,
"AC_normalize_text": _normalize_text,
"AC_slugify": _slugify,
"AC_validate_config": _validate_config,
"AC_resolve_ref": _resolve_ref,
"AC_resolve_refs": _resolve_refs,
Expand Down
30 changes: 30 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -3782,6 +3782,35 @@ def otlp_export_tools() -> List[MCPTool]:
]


def text_normalize_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_normalize_text",
description=("Unicode-normalise 'text' (form NFKC/NFC/..., casefold, "
"collapse whitespace) for robust matching. Returns "
"{text}."),
input_schema=schema(
{"text": {"type": "string"}, "form": {"type": "string"},
"casefold": {"type": "boolean"},
"collapse_ws": {"type": "boolean"}},
["text"]),
handler=h.normalize_text,
annotations=READ_ONLY,
),
MCPTool(
name="ac_slugify",
description=("Produce an ASCII slug from 'text' (de-accent, "
"lowercase, join alnum runs with 'sep'). Returns "
"{slug}."),
input_schema=schema(
{"text": {"type": "string"}, "sep": {"type": "string"}},
["text"]),
handler=h.slugify,
annotations=READ_ONLY,
),
]


def canonical_log_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -5378,6 +5407,7 @@ def media_assert_tools() -> List[MCPTool]:
feature_flag_tools, provenance_tools, json_contract_tools, chaos_tools,
slo_tools, percentiles_tools, bulkhead_tools, http_cassette_tools,
trace_context_tools, baggage_tools, canonical_log_tools, otlp_export_tools,
text_normalize_tools,
secret_ref_tools, config_schema_tools, config_redaction_tools,
data_profile_tools, http_problem_tools, dotenv_tools,
sse_client_tools, layered_config_tools, data_drift_tools, schema_compat_tools,
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1736,6 +1736,16 @@ def baggage_format(items):
return _baggage_format(items)


def normalize_text(text, form="NFKC", casefold=True, collapse_ws=True):
from je_auto_control.utils.executor.action_executor import _normalize_text
return _normalize_text(text, form, casefold, collapse_ws)


def slugify(text, sep="-"):
from je_auto_control.utils.executor.action_executor import _slugify
return _slugify(text, sep)


def canonical_log(fields):
from je_auto_control.utils.executor.action_executor import _canonical_log
return _canonical_log(fields)
Expand Down
9 changes: 9 additions & 0 deletions je_auto_control/utils/text_normalize/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Unicode text normalisation and slug generation for AutoControl."""
from je_auto_control.utils.text_normalize.text_normalize import (
deaccent, fold_whitespace, normalize_quotes, normalize_text, slugify,
)

__all__ = [
"deaccent", "fold_whitespace", "normalize_quotes", "normalize_text",
"slugify",
]
54 changes: 54 additions & 0 deletions je_auto_control/utils/text_normalize/text_normalize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""Unicode text normalisation and slug generation for robust text matching.

``fuzzy`` and ``search_index.tokenize`` only lowercase, and OCR
``find_text_matches`` only ``.lower()`` + substring — so ``"Café"`` (NFC) versus
``"Café"`` (NFD) versus OCR ``"cafe"`` compare unequal. This is the
canonicalisation layer they should run before matching.

Pure standard library (``unicodedata`` / ``re``); imports no ``PySide6``. Every
function is pure (text in, text out), so it is fully deterministic in CI.
"""
import re
import unicodedata

_QUOTE_MAP = {
"‘": "'", "’": "'", "‚": "'", "‛": "'",
"“": '"', "”": '"', "„": '"', "‟": '"',
"–": "-", "—": "-", "−": "-", "…": "...",
" ": " ",
}
_QUOTE_TABLE = str.maketrans(_QUOTE_MAP)


def fold_whitespace(text: str) -> str:
"""Collapse runs of whitespace to single spaces and strip the ends."""
return " ".join((text or "").split())


def deaccent(text: str) -> str:
"""Strip combining diacritical marks (``café`` -> ``cafe``)."""
decomposed = unicodedata.normalize("NFD", text or "")
return "".join(ch for ch in decomposed if not unicodedata.combining(ch))


def normalize_quotes(text: str) -> str:
"""Replace smart quotes, dashes, ellipsis and NBSP with ASCII equivalents."""
return (text or "").translate(_QUOTE_TABLE)


def normalize_text(text: str, *, form: str = "NFKC", casefold: bool = True,
collapse_ws: bool = True) -> str:
"""Canonicalise ``text``: Unicode ``form``, optional casefold + ws fold."""
result = unicodedata.normalize(form, text or "")
if casefold:
result = result.casefold()
if collapse_ws:
result = fold_whitespace(result)
return result


def slugify(text: str, *, sep: str = "-") -> str:
"""Produce an ASCII slug: de-accent, lowercase, join alnum runs with ``sep``."""
base = deaccent(unicodedata.normalize("NFKC", text or "")).lower()
slug = re.sub(r"[^a-z0-9]+", sep, base)
return slug.strip(sep) if sep else slug
Loading
Loading