Skip to content

Add string-distance similarity metrics#307

Merged
JE-Chen merged 2 commits into
devfrom
feat/text-similarity-batch
Jun 21, 2026
Merged

Add string-distance similarity metrics#307
JE-Chen merged 2 commits into
devfrom
feat/text-similarity-batch

Conversation

@JE-Chen

@JE-Chen JE-Chen commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

fuzzy exposed only difflib's gestalt ratio. This adds the edit-distance and token-set metrics it lacks.

  • levenshtein / damerau_levenshtein (adjacent-transposition aware) — integer edit distances.
  • jaro / jaro_winkler[0,1] (Jaro-Winkler boosts a common prefix; standard for short labels).
  • jaccard / dice — character-n-gram set similarity.
  • similarity(a, b, *, metric) — unified entry; normalizes edit distances to 1 - d/max_len.

Pairs with normalize_text (round-10). Helper extraction keeps Jaro/Damerau CC low.

Layers

  • Headless core: utils/text_similarity/ (pure stdlib, zero PySide6).
  • Facade: 7 symbols + __all__.
  • Executor: AC_text_similarity.
  • MCP: ac_text_similarity (read-only).
  • Script Builder: under Data.
  • Tests: test/unit_test/headless/test_text_similarity_batch.py (9 tests; known Jaro/Jaro-Winkler vectors).
  • Docs: v99_features_doc.rst (EN + Zh) + toctrees + 3 README What's-new sections.

Verification

  • pytest test/unit_test/headless/test_text_similarity_batch.py → 9 passed.
  • ruff check je_auto_control/ clean; pylint 10.00/10; bandit clean; radon CC clean.
  • Package stays Qt-free.

fuzzy exposes only difflib's gestalt ratio. Add the edit-distance and
token-set metrics it lacks: levenshtein, damerau_levenshtein (transposition
-aware), jaro / jaro_winkler, char-n-gram jaccard / dice, and a unified
similarity() that normalises every metric to [0, 1]. Pairs with
normalize_text. Wired through facade, executor (AC_text_similarity), MCP,
and the Script Builder with a headless test batch and EN/Zh docs.
@codacy-production

codacy-production Bot commented Jun 21, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 68 complexity · 0 duplication

Metric Results
Complexity 68
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@JE-Chen JE-Chen merged commit 481bbee into dev Jun 21, 2026
2 checks passed
@JE-Chen JE-Chen deleted the feat/text-similarity-batch branch June 21, 2026 23:10
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant