-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quality.duplicate_ngram_fraction(...) expected values ? #343
Comments
Thanks for the report! @KennethEnevoldsen, this is something you and Dan worked on - can you take a look? |
Hi @sondalex, thanks for the report. The reason why we use the boolean array is to avoid counting duplicates twice. E.g. imagine the sentence: It is clearly all duplicates so n_duplicate_characters = n_characters and the fraction 1/1. However, you can get in a situation where you count some characters twice duplicate 2-gram ("a sentence", "sentence.") to get a n/1 fraction where n>1. A solution to this would simply be to create a boolean array of characters (instead of the current array of tokens) and set the respective characters to True (I believe this would also be faster). However with your specific example: |
Is this bug something you'd have time to take a closer look at, @KennethEnevoldsen? Otherwise, a PR is more than welcome, @sondalex! |
Kenneth is busy with MTEB sprint and I'm wrapping up my dissertation, so neither of us have the time to look into this right now. You're more than welcome to submit a PR if you find a solution for this, @sondalex, otherwise it might have to wait a while before we get the chance to take a deeper look. |
Hi @KennethEnevoldsen, thank you for your explanation and thank you @HLasse for following up. I am planning to look more in depth into this when I have some time. I will definitely PR once I find a solution. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I have added unit test for duplicate_ngram_fraction on my fork's branch add_quality_unittest . The test fails which makes wonder whether my test expected values are wrong or whether the current behaviour is not intended.
The report of my test
pytest tests/test_quality.py -k test_duplicate_ngram_fraction
:With a bit more debug information:
(result for n=2)
Should the spans be equivalent across the two for loop blocks ?
Respectively,
TextDescriptives/src/textdescriptives/components/quality.py
Lines 262 to 270 in 93b1d59
TextDescriptives/src/textdescriptives/components/quality.py
Lines 274 to 276 in 93b1d59
In this case, the function could be modified as such:
The text was updated successfully, but these errors were encountered: