-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python spellchecking API #3425
base: main
Are you sure you want to change the base?
Commits on May 25, 2024
-
Refactor: Move some code to new files for reuse
No new code is introduced; only existing code is shuffled around and the functions moved are unchanged as well.
Configuration menu - View commit details
-
Copy full SHA for b28a5a3 - Browse repository at this point
Copy the full SHA b28a5a3View commit details -
Replace
data: str
withcandidates: Sequence[str]
When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.
Configuration menu - View commit details
-
Copy full SHA for 824bd7c - Browse repository at this point
Copy the full SHA 824bd7cView commit details -
Refactor dictionary into a new
Spellchecker
classThis is as close to a 1:1 conversion as possible. It might change whhen we get to designing the API. The callers have been refactored to only perform the lookup once. This was mostly to keep the code more readable. The performance cost does seem noticable, which is unsurprising. This method has a higher cost towards non-matches which is the most common case. This commit causes the performance to drop roughly 10% on its and we are now slower than the goal.
Configuration menu - View commit details
-
Copy full SHA for aa7792f - Browse repository at this point
Copy the full SHA aa7792fView commit details -
Refactor line tokenization to simplify an outer loop
The refactor is a stepping stone towards the next commit where the inner loop is moved to the `Spellchecker`.
Configuration menu - View commit details
-
Copy full SHA for ef5096c - Browse repository at this point
Copy the full SHA ef5096cView commit details -
Rewrite line spellchecking and move most of it into the
Spellchecker
With this rewrite, performance improved slightly and is now down to 7% slower than the baseline (6s vs. 5.6s). There is deliberate an over-indentation left in this commit, since that makes this commit easier to review (without ignoring space changes).
Configuration menu - View commit details
-
Copy full SHA for cd57087 - Browse repository at this point
Copy the full SHA cd57087View commit details -
De-indent loop body (whitespace-only / reformatting-only change)
Deliberately in a separate. There are no functional changes, but there are some reformatting changes (line merges) as a consequence of the de-indent.
Configuration menu - View commit details
-
Copy full SHA for 8bd3517 - Browse repository at this point
Copy the full SHA 8bd3517View commit details -
Support non-regex based tokens for
spellcheck_line
The `Spellchecker` only needs the `group` method from the `re.Match`. With a bit of generics and typing protocols, we can make the `Spellchecker` work with any token type that has a `group()` method. The `codespell` command line tool still assumes `re.Match` but it can get that via its own line tokenizer, so it all works out for everyone.
Configuration menu - View commit details
-
Copy full SHA for 7273c77 - Browse repository at this point
Copy the full SHA 7273c77View commit details -
Speed up spellchecking by ignoring whitespace-only lines
The new API has introduced extra overhead per line being spellchecked. One way of optimizing out this overhead, is to spellcheck fewer lines. An obvious choice here, is to optimize out empty and whitespace-only lines, since they will not have any typos at all (on account of not having any words). A side-effect of this change is that we now spellcheck lines with trailing whitespace stripped. Semantically, this gives the same result (per "whitespace never has typos"). Performance-wise, it is faster in theory because the strings are now shorter (since we were calling `.rstrip()` anyway). In pratice, I am not sure we are going to find any real corpus where the trailing whitespace is noteworthy from a performance point of view. On the performance corpus from codespell-project#3491, this takes out ~0.4s of runtime brining us down to slightly above the 5.6s that made the baseline.
Configuration menu - View commit details
-
Copy full SHA for c4d1738 - Browse repository at this point
Copy the full SHA c4d1738View commit details -
Move
codespell:ignore
check intoSpellchecker
This makes the API automatically avoid some declared false-positives that the command line tool would also filter.
Configuration menu - View commit details
-
Copy full SHA for 3c08c9b - Browse repository at this point
Copy the full SHA 3c08c9bView commit details -
Speed up
codespell:ignore
check by skipping the regex in most casesThe changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.
Configuration menu - View commit details
-
Copy full SHA for ce280c9 - Browse repository at this point
Copy the full SHA ce280c9View commit details -
Refactor: Rename
spellchecker.py
to_spellchecker.py
Per review comment.
Configuration menu - View commit details
-
Copy full SHA for ae0e8d2 - Browse repository at this point
Copy the full SHA ae0e8d2View commit details
Commits on Jun 1, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 37d4b38 - Browse repository at this point
Copy the full SHA 37d4b38View commit details