Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid scikit-learn UserWarning for vectorizer parameter token_pattern #729

Merged
merged 2 commits into from
Aug 16, 2023

Conversation

osma
Copy link
Member

@osma osma commented Aug 16, 2023

scikit-learn vectorizers used by Annif (CountVectorizer, TfidfVectorizer) trigger this warning:

UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

This is a bit surprising, since we are not setting token_pattern ourselves, but its default value is not None. This PR fixes the warning by explicitly setting token_pattern=None whenever the tokenizer parameter is set in Annif calling code.

@osma osma added the bug label Aug 16, 2023
@osma osma added this to the 1.0 milestone Aug 16, 2023
@osma osma self-assigned this Aug 16, 2023
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (40cc2fd) 99.67% compared to head (a84e466) 99.67%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #729   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files          89       89           
  Lines        6397     6401    +4     
=======================================
+ Hits         6376     6380    +4     
  Misses         21       21           
Files Changed Coverage Δ
annif/lexical/mllm.py 100.00% <ø> (ø)
annif/backend/mixins.py 97.82% <100.00%> (+0.20%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@osma osma changed the title fix scikit-learn UserWarning for vectorizer parameter token_pattern Avoid scikit-learn UserWarning for vectorizer parameter token_pattern Aug 16, 2023
@sonarcloud
Copy link

sonarcloud bot commented Aug 16, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma osma merged commit fef6e43 into main Aug 16, 2023
13 of 14 checks passed
@osma osma deleted the fix-sklearn-userwarning-token-pattern branch August 16, 2023 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant