Skip to content

Add near-duplicate text detection (SimHash / MinHash)#308

Merged
JE-Chen merged 2 commits into
devfrom
feat/near-dup-batch
Jun 21, 2026
Merged

Add near-duplicate text detection (SimHash / MinHash)#308
JE-Chen merged 2 commits into
devfrom
feat/near-dup-batch

Conversation

@JE-Chen

@JE-Chen JE-Chen commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

fuzzy_dedupe is O(n²) pairwise with no stable fingerprint and image_dedup only hashes pixels. This adds the text analog.

  • simhash(text, *, bits=64) + hamming_distance (shared with image_dedup) → near-dup by bit distance.
  • near_duplicates(texts, *, max_distance=3) → clusters of indices (a partition).
  • minhash_signature / minhash_similarity → estimated Jaccard for set-overlap dedup.

Uses a fixed blake2b hash (not the salted built-in hash()) so fingerprints are deterministic across runs. hamming_distance is not re-exported from the facade (image_dedup already exports it with identical int semantics).

Layers

  • Headless core: utils/near_dup/ (pure stdlib hashlib/re, zero PySide6).
  • Facade: simhash, near_duplicates, minhash_signature, minhash_similarity + __all__.
  • Executor: AC_simhash, AC_near_duplicates.
  • MCP: ac_simhash, ac_near_duplicates (read-only).
  • Script Builder: both under Data.
  • Tests: test/unit_test/headless/test_near_dup_batch.py (8 tests).
  • Docs: v100_features_doc.rst (EN + Zh) + toctrees + 3 README What's-new sections.

Verification

  • pytest test/unit_test/headless/test_near_dup_batch.py → 8 passed.
  • ruff check je_auto_control/ clean; pylint 10.00/10; bandit clean; radon CC clean.
  • Package stays Qt-free.

fuzzy_dedupe is O(n²) pairwise with no stable fingerprint and image_dedup
only hashes pixels. Add the text analog: simhash + hamming_distance +
near_duplicates clustering, and minhash_signature + minhash_similarity for
estimated Jaccard. Uses a fixed blake2b hash for deterministic
fingerprints. hamming_distance is shared with image_dedup (not re-exported
from the facade to avoid a name clash). Wired through facade, executor
(AC_simhash / AC_near_duplicates), MCP, and the Script Builder with a
headless test batch and EN/Zh docs.
@codacy-production

codacy-production Bot commented Jun 21, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 52 complexity · 0 duplication

Metric Results
Complexity 52
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@JE-Chen JE-Chen merged commit e57e4c8 into dev Jun 21, 2026
16 checks passed
@JE-Chen JE-Chen deleted the feat/near-dup-batch branch June 21, 2026 23:29
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant