Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-22) — Locale-Aware String Collation](#whats-new-2026-06-22--locale-aware-string-collation)
- [What's new (2026-06-22) — Transactional Outbox](#whats-new-2026-06-22--transactional-outbox)
- [What's new (2026-06-22) — Optimistic-Concurrency Versioned Store](#whats-new-2026-06-22--optimistic-concurrency-versioned-store)
- [What's new (2026-06-22) — Per-Stream Sequence-Gap Detection](#whats-new-2026-06-22--per-stream-sequence-gap-detection)
Expand Down Expand Up @@ -160,6 +161,12 @@

---

## What's new (2026-06-22) — Locale-Aware String Collation

Sort strings the way a reader of the language expects. Full reference: [`docs/source/Eng/doc/new_features/v108_features_doc.rst`](docs/source/Eng/doc/new_features/v108_features_doc.rst).

- **`sort_strings` / `collation_compare` / `collation_key`** (`AC_collation_sort`, `AC_collation_compare`): Python's default `sorted` is codepoint order, so `"Z" < "a"` and `"ä"` lands far from `"a"`. This Unicode-Collation-lite key orders by base letter, then accent (secondary), then case (tertiary), with an optional `tailoring` alphabet so Swedish puts `å ä ö` after `z`. Pure-stdlib (`unicodedata`), deterministic across platforms — unlike `locale.strxfrm`.

## What's new (2026-06-22) — Transactional Outbox

Durably buffer events and drain them at-least-once. Full reference: [`docs/source/Eng/doc/new_features/v107_features_doc.rst`](docs/source/Eng/doc/new_features/v107_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-22) — 区域感知字符串排序](#本次更新-2026-06-22--区域感知字符串排序)
- [本次更新 (2026-06-22) — 事务型 Outbox](#本次更新-2026-06-22--事务型-outbox)
- [本次更新 (2026-06-22) — 乐观并发版本存储](#本次更新-2026-06-22--乐观并发版本存储)
- [本次更新 (2026-06-22) — 逐流序号间隙检测](#本次更新-2026-06-22--逐流序号间隙检测)
Expand Down Expand Up @@ -163,6 +164,12 @@

平滑噪声值序列。完整参考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。

## 本次更新 (2026-06-22) — 区域感知字符串排序

依某语言读者的期望排序字符串。完整参考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。

- **`sort_strings` / `collation_compare` / `collation_key`**(`AC_collation_sort`、`AC_collation_compare`):Python 默认的 `sorted` 是码位顺序,因此 `"Z" < "a"`,而 `"ä"` 离 `"a"` 很远。本 Unicode-Collation-lite 键先依基底字母、再依变音符号(次层)、再依大小写(三层)排序,并可用 `tailoring` 字母表让瑞典文将 `å ä ö` 排在 `z` 之后。纯标准库(`unicodedata`)、跨平台确定——不像 `locale.strxfrm`。

## 本次更新 (2026-06-22) — 事务型 Outbox

持久化缓冲事件并以至少一次传递排空。完整参考:[`docs/source/Zh/doc/new_features/v107_features_doc.rst`](../docs/source/Zh/doc/new_features/v107_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-22) — 地區感知字串排序](#本次更新-2026-06-22--地區感知字串排序)
- [本次更新 (2026-06-22) — 交易型 Outbox](#本次更新-2026-06-22--交易型-outbox)
- [本次更新 (2026-06-22) — 樂觀並行版本儲存](#本次更新-2026-06-22--樂觀並行版本儲存)
- [本次更新 (2026-06-22) — 逐串流序號間隙偵測](#本次更新-2026-06-22--逐串流序號間隙偵測)
Expand Down Expand Up @@ -163,6 +164,12 @@

平滑雜訊值序列。完整參考:[`docs/source/Zh/doc/new_features/v102_features_doc.rst`](../docs/source/Zh/doc/new_features/v102_features_doc.rst)。

## 本次更新 (2026-06-22) — 地區感知字串排序

依某語言讀者的期望排序字串。完整參考:[`docs/source/Zh/doc/new_features/v108_features_doc.rst`](../docs/source/Zh/doc/new_features/v108_features_doc.rst)。

- **`sort_strings` / `collation_compare` / `collation_key`**(`AC_collation_sort`、`AC_collation_compare`):Python 預設的 `sorted` 是碼位順序,因此 `"Z" < "a"`,而 `"ä"` 離 `"a"` 很遠。本 Unicode-Collation-lite 鍵先依基底字母、再依變音符號(次層)、再依大小寫(三層)排序,並可用 `tailoring` 字母表讓瑞典文將 `å ä ö` 排在 `z` 之後。純標準函式庫(`unicodedata`)、跨平台具決定性——不像 `locale.strxfrm`。

## 本次更新 (2026-06-22) — 交易型 Outbox

持久化緩衝事件並以至少一次傳遞排空。完整參考:[`docs/source/Zh/doc/new_features/v107_features_doc.rst`](../docs/source/Zh/doc/new_features/v107_features_doc.rst)。
Expand Down
47 changes: 47 additions & 0 deletions docs/source/Eng/doc/new_features/v108_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Locale-Aware String Collation
=============================

``text_normalize`` canonicalises text and ``locale_parse`` formats numbers, but
nothing sorts strings the way a reader of a given language expects. Python's
default ``sorted`` is codepoint order, so ``"Z" < "a"`` and ``"ä"`` lands far
from ``"a"``. A real collation orders by *base letter* first, then *accent*,
then *case*, and lets a locale tailor the alphabet (Swedish sorts ``å ä ö`` after
``z``).

This builds a Unicode-Collation-lite sort key with three levels — primary (base
letter), secondary (diacritics), tertiary (case) — plus an optional alphabet
``tailoring``. Pure standard library (``unicodedata``); imports no ``PySide6``.
Every function is pure, so it is fully deterministic across platforms (unlike
``locale.strxfrm``, which depends on the host's installed locales).

Headless API
------------

.. code-block:: python

from je_auto_control import sort_strings, collation_compare, collation_key

sort_strings(["résumé", "rest", "resume"])
# ['rest', 'resume', 'résumé'] (accent is a secondary difference)

swedish = "abcdefghijklmnopqrstuvwxyzåäö"
sort_strings(["zebra", "äpple", "apple"], tailoring=swedish)
# ['apple', 'zebra', 'äpple'] (å ä ö sort after z)

collation_compare("apple", "Apple") # -1 (lowercase before uppercase)
sort_strings(rows, key=lambda r: r["name"]) # sort dicts by a field

``strength`` (``primary`` / ``secondary`` / ``tertiary``) caps the levels
compared, so ``strength="primary"`` is accent- and case-insensitive.
``tailoring`` is an ordered alphabet whose characters sort in the given order and
before any unlisted character; a precomposed letter such as ``"å"`` keeps its
alphabet rank instead of decomposing to ``a`` + diaeresis. ``collation_key``
returns the raw comparable tuple for use as a ``sorted`` key.

Executor commands
-----------------

``AC_collation_sort`` takes a JSON list and returns ``{sorted}``;
``AC_collation_compare`` returns ``{order: -1|0|1}``. Both accept ``strength``
and ``tailoring``, are exposed as MCP tools (``ac_collation_sort`` /
``ac_collation_compare``) and as Script Builder commands under **Data**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v105_features_doc
doc/new_features/v106_features_doc
doc/new_features/v107_features_doc
doc/new_features/v108_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
39 changes: 39 additions & 0 deletions docs/source/Zh/doc/new_features/v108_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
地區感知字串排序(Collation)
============================

``text_normalize`` 正規化文字、``locale_parse`` 格式化數字,但沒有任何功能能依某語言讀者的期望排序字串。
Python 預設的 ``sorted`` 是碼位順序,因此 ``"Z" < "a"``,而 ``"ä"`` 會離 ``"a"`` 很遠。真正的排序會先依
*基底字母*、再依*變音符號*、再依*大小寫*,並讓地區得以調整字母表(瑞典文將 ``å ä ö`` 排在 ``z`` 之後)。

本功能建立一個 Unicode-Collation-lite 排序鍵,含三個層級——主層(基底字母)、次層(變音符號)、三層(大小寫)
——以及選用的字母表 ``tailoring``。純標準函式庫(``unicodedata``);不匯入 ``PySide6``。每個函式皆為純函式,
因此跨平台完全具決定性(不像 ``locale.strxfrm`` 取決於主機已安裝的地區設定)。

無頭 API
--------

.. code-block:: python

from je_auto_control import sort_strings, collation_compare, collation_key

sort_strings(["résumé", "rest", "resume"])
# ['rest', 'resume', 'résumé'] (變音符號為次層差異)

swedish = "abcdefghijklmnopqrstuvwxyzåäö"
sort_strings(["zebra", "äpple", "apple"], tailoring=swedish)
# ['apple', 'zebra', 'äpple'] (å ä ö 排在 z 之後)

collation_compare("apple", "Apple") # -1 (小寫在大寫之前)
sort_strings(rows, key=lambda r: r["name"]) # 依欄位排序字典

``strength``(``primary`` / ``secondary`` / ``tertiary``)限制比較的層級,因此 ``strength="primary"`` 為
不分變音符號與大小寫。``tailoring`` 是有序字母表,所列字元依給定順序排序,且排在任何未列字元之前;像 ``"å"``
這類預組字元會保有其字母表排名,而非分解為 ``a`` + 分音符。``collation_key`` 回傳可比較的原始 tuple,供作
``sorted`` 的 key 使用。

執行器命令
----------

``AC_collation_sort`` 接受 JSON 列表並回傳 ``{sorted}``;``AC_collation_compare`` 回傳 ``{order: -1|0|1}``。
兩者皆接受 ``strength`` 與 ``tailoring``,並以 MCP 工具(``ac_collation_sort`` / ``ac_collation_compare``)
以及 Script Builder 中 **Data** 分類下的命令提供。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v105_features_doc
doc/new_features/v106_features_doc
doc/new_features/v107_features_doc
doc/new_features/v108_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
8 changes: 8 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,11 @@
)
# Transactional outbox (durable at-least-once event delivery)
from je_auto_control.utils.outbox import Outbox
# Locale-aware string collation (deterministic multi-level sort keys)
from je_auto_control.utils.locale_collation import (
collation_key, sort_strings,
)
from je_auto_control.utils.locale_collation import compare as collation_compare
# CI workflow annotations (GitHub Actions)
from je_auto_control.utils.ci_annotations import (
emit_annotations, format_annotation,
Expand Down Expand Up @@ -943,6 +948,9 @@ def start_autocontrol_gui(*args, **kwargs):
"DedupWindow", "SequenceTracker",
"VersionConflict", "VersionedStore", "check_if_match", "if_match_header",
"Outbox",
"collation_key",
"collation_compare",
"sort_strings",
"emit_annotations", "format_annotation",
"ClipboardHistory", "default_clipboard_history",
"analyze_heal_log", "heal_stats", "scan_secrets",
Expand Down
24 changes: 24 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -2066,6 +2066,30 @@ def _add_resilience_specs(specs: List[CommandSpec]) -> None:
),
description="List events still awaiting successful delivery.",
))
specs.append(CommandSpec(
"AC_collation_sort", "Data", "Text: Collation Sort",
fields=(
FieldSpec("items", FieldType.STRING,
placeholder='["zebra", "apple", "Äpple"]'),
FieldSpec("strength", FieldType.STRING, optional=True,
placeholder="tertiary"),
FieldSpec("tailoring", FieldType.STRING, optional=True,
placeholder="abc...xyzåäö"),
FieldSpec("reverse", FieldType.BOOL, optional=True),
),
description="Locale-aware sort (base letter, then accent, then case).",
))
specs.append(CommandSpec(
"AC_collation_compare", "Data", "Text: Collation Compare",
fields=(
FieldSpec("first", FieldType.STRING, placeholder="apple"),
FieldSpec("second", FieldType.STRING, placeholder="Äpple"),
FieldSpec("strength", FieldType.STRING, optional=True,
placeholder="tertiary"),
FieldSpec("tailoring", FieldType.STRING, optional=True),
),
description="Locale-aware compare; returns order -1/0/1.",
))
specs.append(CommandSpec(
"AC_diff_rows", "Data", "Dataset Diff: Rows by Key",
fields=(
Expand Down
22 changes: 22 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2956,6 +2956,26 @@ def _outbox_pending(name: str) -> Dict[str, Any]:
return {"pending": outbox.pending()}


def _collation_sort(items: Any, strength: str = "tertiary",
tailoring: Any = None, reverse: Any = False) -> Dict[str, Any]:
"""Adapter: locale-aware sort of a list of strings."""
import json
from je_auto_control.utils.locale_collation import sort_strings
if isinstance(items, str):
items = json.loads(items)
ordered = sort_strings(list(items), strength=strength,
tailoring=tailoring or None, reverse=bool(reverse))
return {"sorted": ordered}


def _collation_compare(first: str, second: str, strength: str = "tertiary",
tailoring: Any = None) -> Dict[str, Any]:
"""Adapter: locale-aware comparison of two strings."""
from je_auto_control.utils.locale_collation import compare
return {"order": compare(first, second, strength=strength,
tailoring=tailoring or None)}


def _cas_put(name: str, key: str, value: Any,
expected_version: Any = None) -> Dict[str, Any]:
"""Adapter: optimistic put into a named versioned store."""
Expand Down Expand Up @@ -4638,6 +4658,8 @@ def __init__(self):
"AC_cas_get": _cas_get,
"AC_outbox_enqueue": _outbox_enqueue,
"AC_outbox_pending": _outbox_pending,
"AC_collation_sort": _collation_sort,
"AC_collation_compare": _collation_compare,
"AC_detect_drift": _detect_drift,
"AC_categorical_drift": _categorical_drift,
"AC_diff_rows": _diff_rows,
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/utils/locale_collation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Locale-aware string collation (deterministic multi-level sort keys)."""
from je_auto_control.utils.locale_collation.locale_collation import (
collation_key, compare, sort_strings,
)

__all__ = ["collation_key", "compare", "sort_strings"]
122 changes: 122 additions & 0 deletions je_auto_control/utils/locale_collation/locale_collation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
"""Locale-aware string collation (deterministic multi-level sort keys).

``text_normalize`` canonicalises text and ``locale_parse`` formats numbers, but
nothing sorts strings the way a human reading a given language expects: Python's
default ``sorted`` is codepoint order, so ``"Z" < "a"`` and ``"ä"`` lands far
from ``"a"``. A real collation orders by base letter first, then accent, then
case, and lets a locale tailor the alphabet (Swedish sorts ``å ä ö`` after
``z``).

This builds a Unicode-Collation-lite sort key with three levels — primary (base
letter), secondary (diacritics), tertiary (case) — plus an optional alphabet
``tailoring``. Pure standard library (``unicodedata``); imports no ``PySide6``.
Every function is pure (text in, key/order out), so it is fully deterministic in
CI and across platforms (unlike ``locale.strxfrm``).
"""
import unicodedata
from typing import Callable, Dict, List, Optional, Sequence, Tuple

_STRENGTHS = {"primary": 1, "secondary": 2, "tertiary": 3}

CollationKey = Tuple[Tuple[int, ...], ...]


def _build_tailoring(tailoring: Optional[str]) -> Optional[Dict[str, int]]:
"""Map each character of an ordered alphabet to its primary rank."""
if not tailoring:
return None
ranks: Dict[str, int] = {}
for index, char in enumerate(tailoring):
folded = char.casefold()
if folded not in ranks:
ranks[folded] = index
return ranks


def _untailored_weight(base: str, ranks: Optional[Dict[str, int]],
offset: int) -> int:
"""Primary weight of a folded base character outside any tailoring."""
if not base:
return offset if ranks is not None else 0
return offset + ord(base[0]) if ranks is not None else ord(base[0])


def _char_weights(char: str, ranks: Optional[Dict[str, int]],
offset: int) -> Tuple[List[int], List[int], List[int]]:
"""Primary/secondary/tertiary weight contributions of one character.

A tailored character is treated atomically (no decomposition) so a
precomposed letter like ``"å"`` keeps its alphabet rank; everything else is
NFKD-decomposed so diacritics fall to the secondary level.
"""
folded = char.casefold()
if ranks is not None and folded in ranks:
return [ranks[folded]], [], [1 if char != folded else 0]
primary: List[int] = []
secondary: List[int] = []
tertiary: List[int] = []
for sub in unicodedata.normalize("NFKD", char):
if unicodedata.combining(sub):
secondary.append(ord(sub))
continue
subfold = sub.casefold()
primary.append(_untailored_weight(subfold, ranks, offset))
tertiary.append(1 if sub != subfold else 0)
return primary, secondary, tertiary


def collation_key(text: str, *, strength: str = "tertiary",
tailoring: Optional[str] = None) -> CollationKey:
"""Return a comparable multi-level sort key for ``text``.

Levels: primary (base letter), secondary (diacritics), tertiary (case,
lowercase before uppercase). ``strength`` (``primary`` / ``secondary`` /
``tertiary``) caps the levels compared. ``tailoring`` is an ordered alphabet
whose characters sort in the given order and before any unlisted character
(so a Swedish ``"...xyzåäö"`` puts ``å`` after ``z``).
"""
level = _STRENGTHS.get(strength)
if level is None:
raise ValueError(f"unknown strength: {strength!r}")
ranks = _build_tailoring(tailoring)
offset = len(tailoring) if tailoring else 0
primary: List[int] = []
secondary: List[int] = []
tertiary: List[int] = []
for char in text or "":
char_primary, char_secondary, char_tertiary = _char_weights(
char, ranks, offset)
primary.extend(char_primary)
secondary.extend(char_secondary)
tertiary.extend(char_tertiary)
levels = (tuple(primary), tuple(secondary), tuple(tertiary))
return levels[:level]


def compare(first: str, second: str, *, strength: str = "tertiary",
tailoring: Optional[str] = None) -> int:
"""Return ``-1`` / ``0`` / ``1`` ordering ``first`` against ``second``."""
key_first = collation_key(first, strength=strength, tailoring=tailoring)
key_second = collation_key(second, strength=strength, tailoring=tailoring)
if key_first < key_second:
return -1
if key_first > key_second:
return 1
return 0


def sort_strings(items: Sequence[str], *, strength: str = "tertiary",
tailoring: Optional[str] = None, reverse: bool = False,
key: Optional[Callable[[object], str]] = None) -> List[object]:
"""Return ``items`` sorted by collation key.

``key`` extracts the string from each item (default: the item itself), so
dicts or tuples can be sorted by one of their fields.
"""
extract = key or (lambda item: item)

def sort_key(item: object) -> CollationKey:
return collation_key(str(extract(item)), strength=strength,
tailoring=tailoring)

return sorted(items, key=sort_key, reverse=reverse)
Loading
Loading