feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630
feat(core)!: more explicit handling of case-sensitivity in dictionaries#263086xsk wants to merge 159 commits intoAutomattic:masterfrom
Conversation
`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.
- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`. - Updates dictionary functions to more explicitly handle casing. There are now functions to get a specific word case-sensitively, multiple words case-insensitively, and ditto but merge all metadata (which was the old behavior). - Fixes issue where `SpellCheck` would sometimes mark words as incorrect if an identical entry with different casing existed in the dictionary (e.g. OS, PR, etc.). - Makes `SpellCheck` no longer care about casing, since that is handled by `OrthographicConsistency`.
…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.
Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.
Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.
Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.
Both variants are defined in the dictionary, and appear to be valid in this case.
The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.
Remove needless borrow.
|
Does it handle the same word having multiple This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc. |
|
I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:
I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035 Apologies if I've mixed up any concepts myself - it's been a while (-: |
|
Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a I did change the field to store a I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered. |
If I recall correctly the old behaviour is for each new one to blindly stomp on the previous
Yes I never took on the case folding as I was worried about stepping on the toes of the main devs thinking it might've been an intentional design choice plus I didn't find out it was definitely case folding until months after I did the multi-
It looks like you're fixing up a whole bunch of sticky issues at once - good stuff! |
Support a word having multiple words that it's derived from.
I could be wrong, but I believe it will just keep the first value that was set, since in derived_from: self.derived_from.or(other.derived_from),...it sets the I threw together the changes so that it now supports a word having multiple |
Cool. I remember when I was working on that stuff that it turned out I missed some cases where a word could have multiple |
Avoid unnecessary cloning when using `DictWordMetadata::append`. This reduces the performance regression introduced by storing a Vec inside `DictWordMetadata`.
Don't fall back to `WordMap`/`MutableDictionary` implementations of `Dictionary` when dynamic dispatch is possible. This should hopefully fix the issue where certain tests weren't getting the suggestions they expected due to the wrong fuzzy matching implementation being used.
Added to `OrthFlags`: - `CASE_FLAGS` (const) - `case_flags` (fn)
Prevent `OrthographicConsistency` from suggesting a conversion to ALLCAPS for words that have multiple valid case-variants.
Add note regarding `Self: Sized` bound in some `Dictionary` functions.
Remove redundant code introduced by merge.
Derive `Ord` for `CanonicalWordId` and `CaseFoldedWordId`.
Use `BTreeSet` instead of `Vec` as the underlying type for `dict_word_metadata::DerivedFrom`. This fixes Eq and Ord related implementations for `DictWordMetadata` by removing from consideration the order in which the words were inserted into `derived_from`.

Issues
Fixes #2585
Fixes #3073
Fixes #3080
Related to #1688
Related to #2411
(Probably a few more, this has been a long-running issue.)
Description
WordIdwithCanonicalWordIdandCaseFoldedWordId.CanonicalWordIdhashes the input word as is, without lowercasing or normalization.CaseFoldedWordIdlowercases and normalizes the word before hashing it (this is the current behavior ofWordId).WordIdPairandto make it easier to work withEitherWordIdCanonicalWordId/CaseFoldedWordId.1 functions.get_word_metadata_combinedSpellCheckdespite being present in the dictionary.MakesSpellCheckno longer care about capitalization, fully transferring that responsibility toOrthographicConsistency(which already handles it anyway for the most part).derived_fromfield inDictWordMetadatato contain multiple words.DerivedFromstruct to contain this data without exposing implementation details. The current implementation uses a Vec which allows for the insertion of unique values only.Runningcargo bench --bench parse_essayshowed a performance regression of about ~3% across the board with these changes. I didn't notice any performance issues when testing the modified build ofharper-ls.After some performance work these changes now show a ~7-27% performance improvement on the benchmarks over master, see #2630 (comment).Changes which produced these performance improvements have been split off into their own PRs and merged separately.
How Has This Been Tested?
cargo testChecklist
Footnotes
get_word_metadata_combinedhas been renamed toget_word_metadatain 5a5ad04, preserving the current name. ↩