Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python spellchecking API #3425

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Commits on May 25, 2024

  1. Refactor: Move some code to new files for reuse

    No new code is introduced; only existing code is shuffled around and
    the functions moved are unchanged as well.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    b28a5a3 View commit details
    Browse the repository at this point in the history
  2. Replace data: str with candidates: Sequence[str]

    When the spelling dictionaries are loaded, previously the correction
    line was just stored in memory as a simple text. Through out the code,
    callers would then have to deal with the `data` attribute, correctly
    `split()` + `strip()` it. With this change, the dictionary parsing
    code now encapsulates this problem.
    
    The auto-correction works from the assumption that there is only one
    candidate. This assumption is invariant and seem to be properly
    maintained in the code. Therefore, we can just pick the first
    candidate word when doing a correction.
    
    In the code, the following name changes are performed:
    
     * `Misspelling.data` -> `Misspelling.candidates`
     * `fixword` -> `candidates` when used for multiple candidates
       (`fixword` remains for when it is a correction)
    
    On performance:
    
    Performance-wise, this change moves computation from "checking" time
    to "startup" time.  The performance cost does not appear to be
    noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
    the ratio of cased vs. non-cased corrections with multiple candidates
    in mind.
    
    The all lowercase typo is now slightly more expensive (it was passed
    throughout `fix_case` and fed directly into the `print` in the
    original code. In the new code, it will always need a `join`).  There
    are still an overweight of lower-case only corrections in general, so
    the unconditional `.join` alone is not sufficient to affect the
    performance noticeably.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    824bd7c View commit details
    Browse the repository at this point in the history
  3. Refactor dictionary into a new Spellchecker class

    This is as close to a 1:1 conversion as possible. It might change
    whhen we get to designing the API. The callers have been refactored to
    only perform the lookup once. This was mostly to keep the code more
    readable.
    
    The performance cost does seem noticable, which is unsurprising. This
    method has a higher cost towards non-matches which is the most common
    case.  This commit causes the performance to drop roughly 10% on its
    and we are now slower than the goal.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    aa7792f View commit details
    Browse the repository at this point in the history
  4. Refactor line tokenization to simplify an outer loop

    The refactor is a stepping stone towards the next commit where the
    inner loop is moved to the `Spellchecker`.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    ef5096c View commit details
    Browse the repository at this point in the history
  5. Rewrite line spellchecking and move most of it into the Spellchecker

    With this rewrite, performance improved slightly and is now down to 7%
    slower than the baseline (6s vs. 5.6s).
    
    There is deliberate an over-indentation left in this commit, since
    that makes this commit easier to review (without ignoring space
    changes).
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    cd57087 View commit details
    Browse the repository at this point in the history
  6. De-indent loop body (whitespace-only / reformatting-only change)

    Deliberately in a separate. There are no functional changes, but there
    are some reformatting changes (line merges) as a consequence of the
    de-indent.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    8bd3517 View commit details
    Browse the repository at this point in the history
  7. Support non-regex based tokens for spellcheck_line

    The `Spellchecker` only needs the `group` method from the `re.Match`.
    With a bit of generics and typing protocols, we can make the
    `Spellchecker` work with any token type that has a `group()` method.
    
    The `codespell` command line tool still assumes `re.Match` but it can
    get that via its own line tokenizer, so it all works out for everyone.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    7273c77 View commit details
    Browse the repository at this point in the history
  8. Speed up spellchecking by ignoring whitespace-only lines

    The new API has introduced extra overhead per line being spellchecked.
    One way of optimizing out this overhead, is to spellcheck fewer lines.
    An obvious choice here, is to optimize out empty and whitespace-only
    lines, since they will not have any typos at all (on account of not
    having any words).
    
    A side-effect of this change is that we now spellcheck lines with
    trailing whitespace stripped. Semantically, this gives the same result
    (per "whitespace never has typos"). Performance-wise, it is faster in
    theory because the strings are now shorter (since we were calling
    `.rstrip()` anyway). In pratice, I am not sure we are going to find
    any real corpus where the trailing whitespace is noteworthy from a
    performance point of view.
    
    On the performance corpus from codespell-project#3491, this takes out ~0.4s of
    runtime brining us down to slightly above the 5.6s that made the
    baseline.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    c4d1738 View commit details
    Browse the repository at this point in the history
  9. Move codespell:ignore check into Spellchecker

    This makes the API automatically avoid some declared false-positives
    that the command line tool would also filter.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    3c08c9b View commit details
    Browse the repository at this point in the history
  10. Speed up codespell:ignore check by skipping the regex in most cases

    The changes to provide a public API had some performance related costs
    of about 1% runtime. There is no trivial way to offset this any
    further without undermining the API we are building. However, we can
    pull performance-related shenanigans to compenstate for the cost
    introduced.
    
    The codespell codebase unsurprisingly spends a vast majority of its
    runtime in various regex related code such as `search` and `finditer`.
    
    The best way to optimize runtime spend in regexes is to not do a regex
    in the first place, since the regex engine has a rather steep overhead
    over regular string primitives (that is the cost of flexibility). If
    the regex rarely matches and there is a very easy static substring
    that can be used to rule out the match, then you can speed up the code
    by using `substring in string` as a conditional to skip the
    regex. This is assuming the regex is used enough for the performance
    to matter.
    
    An obvious choice here falls on the `codespell:ignore` regex, because
    it has a very distinctive substring in the form of `codespell:ignore`,
    which will rule out almost all lines that will not match.
    
    With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
    mentioned in codespell-project#3419.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    ce280c9 View commit details
    Browse the repository at this point in the history
  11. Refactor: Rename spellchecker.py to _spellchecker.py

    Per review comment.
    nthykier committed May 25, 2024
    Configuration menu
    Copy the full SHA
    ae0e8d2 View commit details
    Browse the repository at this point in the history

Commits on Jun 1, 2024

  1. Configuration menu
    Copy the full SHA
    37d4b38 View commit details
    Browse the repository at this point in the history