Skip to content

Add Unicode text normalisation and slugify#305

Merged
JE-Chen merged 1 commit into
devfrom
feat/text-normalize-batch
Jun 21, 2026
Merged

Add Unicode text normalisation and slugify#305
JE-Chen merged 1 commit into
devfrom
feat/text-normalize-batch

Conversation

@JE-Chen

@JE-Chen JE-Chen commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

fuzzy and search_index.tokenize only lowercase, and OCR find_text_matches only .lower()+substring — so "Café" (NFC) vs "Café" (NFD) vs "cafe" compare unequal. This adds the canonicalization layer to run before matching.

  • normalize_text(text, *, form="NFKC", casefold=True, collapse_ws=True).
  • deaccent, normalize_quotes (smart quotes/dashes/ellipsis/NBSP → ASCII), fold_whitespace, slugify(text, sep="-").

Round-10 research pick (text-normalization lane); unicodedata was imported nowhere in text modules.

Layers

  • Headless core: utils/text_normalize/ (pure stdlib unicodedata/re, zero PySide6).
  • Facade: 5 symbols + __all__.
  • Executor: AC_normalize_text, AC_slugify.
  • MCP: ac_normalize_text, ac_slugify (read-only).
  • Script Builder: both under Data.
  • Tests: test/unit_test/headless/test_text_normalize_batch.py (10 tests, incl. NFC/NFD match).
  • Docs: v97_features_doc.rst (EN + Zh) + toctrees + 3 README What's-new sections.

Verification

  • pytest test/unit_test/headless/test_text_normalize_batch.py → 10 passed.
  • ruff check je_auto_control/ clean; pylint 10.00/10; bandit clean; radon CC clean.
  • Package stays Qt-free.

fuzzy and search_index.tokenize only lowercase and OCR find_text_matches
only .lower()+substring, so the same text in different Unicode forms
(NFC/NFD), accents, or smart quotes compares unequal. Add normalize_text
(NFKC + casefold + whitespace fold), deaccent, normalize_quotes,
fold_whitespace, and slugify — the canonicalisation layer to run before
matching. Wired through facade, executor (AC_normalize_text / AC_slugify),
MCP, and the Script Builder with a headless test batch and EN/Zh docs.
@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 33 complexity · 0 duplication

Metric Results
Complexity 33
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@JE-Chen JE-Chen merged commit 331a6c2 into dev Jun 21, 2026
16 checks passed
@JE-Chen JE-Chen deleted the feat/text-normalize-batch branch June 21, 2026 20:37
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant