Skip to content

feat(search): add unicode character removal for fuzzy matching#4360

Open
4yinn wants to merge 3 commits intoFlow-Launcher:devfrom
4yinn:feature/4149-diacritics-insensitive-search
Open

feat(search): add unicode character removal for fuzzy matching#4360
4yinn wants to merge 3 commits intoFlow-Launcher:devfrom
4yinn:feature/4149-diacritics-insensitive-search

Conversation

@4yinn
Copy link

@4yinn 4yinn commented Mar 15, 2026

Description

Added a preprocessing step to FuzzySearch to normalize Unicode characters and remove diacritics before matching.

This allows searches to be accent-insensitive, improving usability. For example, searching for camera will also match câmera.

Related Issue

Closes #4149


Summary by cubic

Makes fuzzy search accent-insensitive by normalizing and stripping Unicode diacritics before matching. Adds null/empty input checks to avoid unnecessary processing and false matches (relates to #4149).

  • Summary of changes
    • Changed: FuzzyMatch now trims and removes diacritics from both query and target before translation/matching; minor formatting tweaks only.
    • Added: RemoveAccents(string) using Unicode normalization (FormD) and filtering NonSpacingMark, then re-composing (FormC); early return when query is whitespace or stringToCompare is empty after preprocessing.
    • Removed: None.
    • Memory: Small per-call allocations for normalized strings and a short-lived StringBuilder; early check avoids extra work; no long-lived impact.
    • Security: No new external inputs or calls; stricter input validation reduces edge-case risks; scope limited to matching behavior.
    • Tests: No new unit tests.

Written for commit f905530. Summary will update on new commits.

Introduced a string preprocessing step in FuzzySearch that removes unicode characters. This improves the search
     experience by allowing users to find results regardless of accents or special formatting.
@github-actions github-actions bot added this to the 2.2.0 milestone Mar 15, 2026
@gitstream-cm
Copy link

gitstream-cm bot commented Mar 15, 2026

🥷 Code experts: Jack251970

Jack251970 has most 👩‍💻 activity in the files.
Jack251970 has most 🧠 knowledge in the files.

See details

Flow.Launcher.Infrastructure/StringMatcher.cs

Activity based on git-commit:

Jack251970
MAR
FEB
JAN
DEC
NOV
OCT

Knowledge based on git-blame:
Jack251970: 100%

✨ Comment /gs review for LinearB AI review. Learn how to automate it here.

@gitstream-cm
Copy link

gitstream-cm bot commented Mar 15, 2026

Be a legend 🏆 by adding a before and after screenshot of the changes you made, especially if they are around UI/UX.

@coderabbitai coderabbitai bot added the enhancement New feature or request label Mar 15, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

Accent-insensitive matching was added to FuzzyMatch by normalizing and stripping diacritical marks from both the query and comparison strings via a new RemoveAccents helper using Unicode normalization.

Changes

Cohort / File(s) Summary
Accent-Insensitive Matching
Flow.Launcher.Infrastructure/StringMatcher.cs
Adds RemoveAccents(string) using Unicode normalization (FormD/FormC) and NonSpacingMark filtering; applies it to query and string-to-compare at start of FuzzyMatch. Adds System.Globalization and System.Text usings. Minor formatting/argument adjustments only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR partially addresses issue #4149 by implementing accent-insensitive matching, but lacks the requested toggle option allowing users to choose whether to ignore diacritics. Clarify whether the diacritics-insensitive matching is always enabled or if a user-configurable toggle was implemented as originally requested in #4149.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding Unicode character removal (diacritics) for fuzzy matching, which is the primary feature introduced in this PR.
Out of Scope Changes check ✅ Passed All changes are focused on implementing diacritics-insensitive fuzzy matching in StringMatcher.cs with supporting Unicode normalization utilities; no unrelated changes detected.
Description check ✅ Passed The pull request description clearly relates to the changeset, explaining the accent-insensitive matching feature and referencing the closed issue #4149.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
Flow.Launcher.Infrastructure/StringMatcher.cs (1)

66-73: Consider adding an option to toggle diacritics-insensitive matching.

The linked issue #4149 specifically requests an option to enable/disable diacritics-insensitive matching, similar to the Everything app. The current implementation always removes accents with no way to opt out.

If this is an intentional scope reduction, consider documenting it. Otherwise, you could add a setting flag:

if (_settings.IgnoreDiacritics)
{
    query = RemoveAccents(query);
    stringToCompare = RemoveAccents(stringToCompare);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Flow.Launcher.Infrastructure/StringMatcher.cs` around lines 66 - 73,
FuzzyMatch currently always calls RemoveAccents on query and stringToCompare
which forces diacritics-insensitive matching; add a configurable toggle (e.g. a
boolean setting like _settings.IgnoreDiacritics) and only call RemoveAccents
when that flag is true, updating FuzzyMatch's logic and any settings class to
expose the option so consumers can enable/disable diacritics-insensitive
matching; reference the RemoveAccents and FuzzyMatch methods and the MatchOption
usage when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Flow.Launcher.Infrastructure/StringMatcher.cs`:
- Around line 249-250: In StringMatcher.cs (inside the method using
Char.GetUnicodeCategory), rename the local variable unicodedCategory to
unicodeCategory to fix the typo and consistency, and also replace the type alias
use from Char.GetUnicodeCategory(c) to char.GetUnicodeCategory(c) for C# keyword
consistency; update any subsequent references to the variable name
(unicodeCategory) accordingly.
- Around line 72-73: The matching indices are computed on the post-RemoveAccents
normalized string but applied to the original input, causing mismatches when
combining marks change string length; update the flow in the method that calls
RemoveAccents to 1) keep the original stringToCompare (save
originalStringToCompare before normalization), 2) produce an index mapping from
normalized-to-original positions when RemoveAccents transforms the string
(similar to the existing TranslationMapping for alphabet translation), and 3)
before returning, remap all indices in indexList and any positions inside
acronymMatchData using that accent-removal mapping (in addition to the existing
TranslationMapping) so returned indices align with the original string. Ensure
the mapping logic is used wherever indices are translated back (same spot where
TranslationMapping is applied).

---

Nitpick comments:
In `@Flow.Launcher.Infrastructure/StringMatcher.cs`:
- Around line 66-73: FuzzyMatch currently always calls RemoveAccents on query
and stringToCompare which forces diacritics-insensitive matching; add a
configurable toggle (e.g. a boolean setting like _settings.IgnoreDiacritics) and
only call RemoveAccents when that flag is true, updating FuzzyMatch's logic and
any settings class to expose the option so consumers can enable/disable
diacritics-insensitive matching; reference the RemoveAccents and FuzzyMatch
methods and the MatchOption usage when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0172967a-b14a-495a-9c63-e4cd5ee587ab

📥 Commits

Reviewing files that changed from the base of the PR and between c9dbc33 and d2f8663.

📒 Files selected for processing (1)
  • Flow.Launcher.Infrastructure/StringMatcher.cs

Comment on lines +72 to +73
query = RemoveAccents(query);
stringToCompare = RemoveAccents(stringToCompare);
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for MatchData usage to understand the highlighting mechanism
rg -n -C3 'MatchData' --type=cs

Repository: Flow-Launcher/Flow.Launcher

Length of output: 12041


🏁 Script executed:

#!/bin/bash
# Find the context around indexList in StringMatcher.cs
rg -n 'indexList' Flow.Launcher.Infrastructure/StringMatcher.cs -B2 -A2

Repository: Flow-Launcher/Flow.Launcher

Length of output: 1657


🏁 Script executed:

#!/bin/bash
# Find where MatchResult is constructed in StringMatcher.cs to see all return paths
rg -n 'new MatchResult\|return new' Flow.Launcher.Infrastructure/StringMatcher.cs -B1 -A1

Repository: Flow-Launcher/Flow.Launcher

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Get the full context of StringMatcher.cs around lines 60-90 to understand the flow
sed -n '60,90p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 2168


🏁 Script executed:

#!/bin/bash
# Get context around line 79 where translationMapping is mentioned
sed -n '70,150p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 4371


🏁 Script executed:

#!/bin/bash
# Check how FuzzyMatch is called and what string is highlighted
# Look at a specific plugin example
sed -n '100,110p' Plugins/Flow.Launcher.Plugin.Explorer/Search/ResultManager.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 703


🏁 Script executed:

#!/bin/bash
# Check the Result model to understand how TitleHighlightData is used
rg -n 'TitleHighlightData' Flow.Launcher.Plugin/SharedModels/ -B2 -A2

Repository: Flow-Launcher/Flow.Launcher

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Search for where TitleHighlightData is actually applied (UI rendering)
rg -n 'TitleHighlightData' --type=cs | grep -v 'test\|Test' | head -20

Repository: Flow-Launcher/Flow.Launcher

Length of output: 1514


🏁 Script executed:

#!/bin/bash
# Find RemoveAccents method implementation
rg -n 'private.*RemoveAccents|public.*RemoveAccents' Flow.Launcher.Infrastructure/StringMatcher.cs -A10

Repository: Flow-Launcher/Flow.Launcher

Length of output: 566


🏁 Script executed:

#!/bin/bash
# Check test cases for accent handling
find . -name '*StringMatcher*Test*' -o -name '*Test*StringMatcher*' | head -5

Repository: Flow-Launcher/Flow.Launcher

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Search for test files
fd -e cs -p '*test*' | rg -i 'stringmatcher|fuzzymatch' | head -10

Repository: Flow-Launcher/Flow.Launcher

Length of output: 356


🏁 Script executed:

#!/bin/bash
# Get the complete RemoveAccents method
sed -n '240,260p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 1034


🏁 Script executed:

#!/bin/bash
# Search for test files properly
fd --glob '**/*Test*.cs' | rg -i 'stringmatcher|fuzzymatch'

Repository: Flow-Launcher/Flow.Launcher

Length of output: 109


🏁 Script executed:

#!/bin/bash
# Also check if there are unit test files
find . -type f -name '*.cs' | xargs grep -l 'RemoveAccents\|FuzzyMatch' | grep -i test

Repository: Flow-Launcher/Flow.Launcher

Length of output: 111


🏁 Script executed:

#!/bin/bash
# Check the test file for accent/diacritic test cases
rg -n 'RemoveAccents\|accent\|diacritic\|café\|é' Flow.Launcher.Test/FuzzyMatcherTest.cs -i

Repository: Flow-Launcher/Flow.Launcher

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Get the full FuzzyMatcherTest.cs to understand test coverage
wc -l Flow.Launcher.Test/FuzzyMatcherTest.cs

Repository: Flow-Launcher/Flow.Launcher

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Look at the test file to see if accents are tested
sed -n '1,100p' Flow.Launcher.Test/FuzzyMatcherTest.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 4471


🏁 Script executed:

#!/bin/bash
# Search through the entire test file for any accent or diacritic related tests
cat Flow.Launcher.Test/FuzzyMatcherTest.cs | grep -n -i 'accent\|diacritic\|café\|é\|ñ\|à\|ü' | head -20

Repository: Flow-Launcher/Flow.Launcher

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Let's check what's in the second half of the test file
sed -n '200,300p' Flow.Launcher.Test/FuzzyMatcherTest.cs | cat -n

Repository: Flow-Launcher/Flow.Launcher

Length of output: 7418


🏁 Script executed:

#!/bin/bash
# Create a simple test to understand RemoveAccents behavior
python3 << 'EOF'
# Simulate C# NormalizationForm.FormD decomposition and accent removal

import unicodedata

test_strings = [
    "café",      # Regular accent
    "naïve",     # Diaeresis
    "Zürich",    # Umlaut
    "São Paulo", # Tilde
    "e̊",         # e with combining ring
]

for s in test_strings:
    # FormD = Decomposed
    decomposed = unicodedata.normalize('NFD', s)
    
    # Remove combining marks (NonSpacingMark category)
    removed = ''.join(c for c in decomposed if unicodedata.category(c) != 'Mn')
    
    # FormC = Composed (what the code does)
    recomposed = unicodedata.normalize('NFC', removed)
    
    print(f"Original:     '{s}' (length: {len(s)})")
    print(f"Decomposed:   '{decomposed}' (length: {len(decomposed)})")
    print(f"Accents removed: '{removed}' (length: {len(removed)})")
    print(f"Recomposed:   '{recomposed}' (length: {len(recomposed)})")
    print(f"Match indices broken? {len(s) != len(recomposed)}")
    print()
EOF

Repository: Flow-Launcher/Flow.Launcher

Length of output: 916


Index mismatch when original string contains combining marks or is in decomposed form.

When RemoveAccents normalizes the string to FormD (decomposed), removes combining marks, then recomposes to FormC, the string length may change—particularly with combining marks like ring above () that decompose to multiple characters. The indices computed during matching are based on the post-RemoveAccents string, but are used to highlight the original input string. If the original string was in decomposed form or contains combining marks, the indices will be misaligned.

Unlike the translationMapping for alphabet translation (created after accent removal), there is no equivalent mapping to adjust indices from the normalized string back to the original. No test cases currently cover this scenario.

Consider:

  1. Storing the original stringToCompare before normalization
  2. Creating an index mapping for accent removal changes (similar to TranslationMapping)
  3. Applying this mapping to both indexList and acronymMatchData before returning
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Flow.Launcher.Infrastructure/StringMatcher.cs` around lines 72 - 73, The
matching indices are computed on the post-RemoveAccents normalized string but
applied to the original input, causing mismatches when combining marks change
string length; update the flow in the method that calls RemoveAccents to 1) keep
the original stringToCompare (save originalStringToCompare before
normalization), 2) produce an index mapping from normalized-to-original
positions when RemoveAccents transforms the string (similar to the existing
TranslationMapping for alphabet translation), and 3) before returning, remap all
indices in indexList and any positions inside acronymMatchData using that
accent-removal mapping (in addition to the existing TranslationMapping) so
returned indices align with the original string. Ensure the mapping logic is
used wherever indices are translated back (same spot where TranslationMapping is
applied).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn Do you think we need to handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn Do you think we need to handle this?

From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.

Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.

Sorry for the late response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn Do you think we need to handle this?

From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.

Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.

Sorry for the late response.

I'm sorry, it seems you misunderstood me. What I meant was that from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn Do you think we need to handle this?

From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.
Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.
Sorry for the late response.

I'm sorry, it seems you misunderstood me. What I meant was that from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.

Sorry for the late reply. hmm ok. Do u think this implementation isn’t necessary, or does it just need further optimization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.

Yeah, my language is pretty boring when it comes to accents. Sorry for the late reply again — I was working. I’ll check this now and look for improvements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.

Yeah, my language is pretty boring when it comes to accents. Sorry for the late reply again — I was working. I’ll check this now and look for improvements.

Thanks for your reply!

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="Flow.Launcher.Infrastructure/StringMatcher.cs">

<violation number="1" location="Flow.Launcher.Infrastructure/StringMatcher.cs:72">
P1: `query` is not revalidated after accent stripping, so mark-only Unicode input can become empty and crash at `querySubstrings[0]`.</violation>

<violation number="2" location="Flow.Launcher.Infrastructure/StringMatcher.cs:73">
P2: Match indices are computed after accent-stripping but never mapped back to original string positions, causing incorrect highlight offsets for decomposed Unicode text.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Jack251970 and others added 2 commits March 16, 2026 01:17
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
@Jack251970 Jack251970 added the review in progress Indicates that a review is in progress for this PR label Mar 15, 2026
@prlabeler prlabeler bot added the bug Something isn't working label Mar 15, 2026
Copy link
Member

@Jack251970 Jack251970 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! But from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.

@Jack251970 Jack251970 removed the enhancement New feature or request label Mar 15, 2026
@coderabbitai coderabbitai bot removed the bug Something isn't working label Mar 15, 2026
@VictoriousRaptor
Copy link
Contributor

Please don't involve too much code format changes so we can trace changes in the future.

Copy link
Member

@jjw24 jjw24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR.

StringMatcher is performance-critical, so we must minimize additional overhead. Please address the following:

  1. Consolidate Loops: The current implementation performs an initial loop to identify and remove Unicode characters, followed by a second loop for the main matching logic. Please merge it into the main loop to avoid redundant iteration.

  2. Conditional Unicode Processing: We should make Unicode removal optional—for instance, by adding a toggle for non-English languages. This prevents unnecessary processing when the feature isn't required.

@4yinn
Copy link
Author

4yinn commented Mar 16, 2026

Thank you for your PR.

StringMatcher is performance-critical, so we must minimize additional overhead. Please address the following:

  1. Consolidate Loops: The current implementation performs an initial loop to identify and remove Unicode characters, followed by a second loop for the main matching logic. Please merge it into the main loop to avoid redundant iteration.
  2. Conditional Unicode Processing: We should make Unicode removal optional—for instance, by adding a toggle for non-English languages. This prevents unnecessary processing when the feature isn't required.

Yes, I can see the problem. I'll think about a way to implement it with better optimizations.

@Jack251970 Jack251970 added the enhancement New feature or request label Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request review in progress Indicates that a review is in progress for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Allow for searching without diacritics match (e.g., “camera” = “câmera”)

4 participants