quality.duplicate_ngram_fraction(...) expected values ? #343

sondalex · 2024-04-09T18:28:24Z

I have added unit test for duplicate_ngram_fraction on my fork's branch add_quality_unittest . The test fails which makes wonder whether my test expected values are wrong or whether the current behaviour is not intended.

The report of my test pytest tests/test_quality.py -k test_duplicate_ngram_fraction:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/sondalex/alex/Source/TextDescriptives
configfile: pyproject.toml
collected 50 items / 49 deselected / 1 selected

tests/test_quality.py F                                                  [100%]

=================================== FAILURES ===================================
________________________ test_duplicate_ngram_fraction _________________________

nlp = <spacy.lang.en.English object at 0x7efec116b730>

    def test_duplicate_ngram_fraction(nlp):
        text = "This is a test. This is another test."
        doc = nlp(text)
        span = doc[0:]
        result = duplicate_ngram_fraction(span, (2, 3))
    
        expected = {2: (len("This is") * 2 + len("test.") * 2) / len(span.text), 3: 0}
>       assert result == expected
E       assert {2: 0.6756756...56757, 3: 0.0} == {2: 0.6486486486486487, 3: 0}
E         
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {2: 0.6756756756756757} != {2: 0.6486486486486487}
E         Use -v to get more diff

tests/test_quality.py:161: AssertionError
=========================== short test summary info ============================
FAILED tests/test_quality.py::test_duplicate_ngram_fraction - assert {2: 0.67...
======================= 1 failed, 49 deselected in 2.18s =======================

With a bit more debug information:

------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------
Spans of ngrams with count higher than 1:
This is
This is
test.
test.

-------------------

Spans tagged as duplicates:
range=(0, 2), span=This is
range=(3, 7), span=test. This is
range=(8, 10), span=test.

...

(result for n=2)

Should the spans be equivalent across the two for loop blocks ?

Respectively,

TextDescriptives/src/textdescriptives/components/quality.py

Lines 262 to 270 in 93b1d59

    
           for ngram_size, ngrams in shingles_count.items(): 
        
               # create a boolean array of the same length as the text 
        
               # where True indicates that the token is a duplicate 
        
               is_duplicate = np.zeros(max_len, dtype=bool) 
        
               # set duplicate tokens to True 
        
               for ngram, count in ngrams.items(): 
        
                   if count["count"] > 1:  # type: ignore 
        
                       for ngram_span in count["span"]:  # type: ignore 
        
                           is_duplicate[ngram_span.start : ngram_span.end] = True

TextDescriptives/src/textdescriptives/components/quality.py

Lines 274 to 276 in 93b1d59

    
           for start, end in get_ranges(is_duplicate): 
        
               _span = span[start:end] 
        
               duplicate_chars += _span.end_char - _span.start_char

In this case, the function could be modified as such:

def duplicate_ngram_fraction(
    span: Union[Span, Doc],
    ngram_range: Tuple[int, int],
) -> Dict[int, float]:
    """
    ...
    """
    chr_len = len(span.text)
    if chr_len == 0:
        return {n: 0.0 for n in range(ngram_range[0], ngram_range[1] + 1)}
    shingles_count = span_ngrams(span, ngram_range)
    duplicate_chr_fraction = {}
    for ngram_size, ngrams in shingles_count.items():
        duplicate_chars = 0
        for ngram, count in ngrams.items():
            if count["count"] > 1:
                for ngram_span in count["span"]:
                    duplicate_chars += ngram_span.end_char - ngram_span.start_char


        duplicate_chr_fraction[ngram_size] = duplicate_chars / chr_len
    return duplicate_chr_fraction

The text was updated successfully, but these errors were encountered:

HLasse · 2024-04-10T06:49:38Z

Thanks for the report! @KennethEnevoldsen, this is something you and Dan worked on - can you take a look?

KennethEnevoldsen · 2024-04-10T07:48:44Z

Hi @sondalex, thanks for the report. The reason why we use the boolean array is to avoid counting duplicates twice. E.g. imagine the sentence:
"a sentence. a sentence."

It is clearly all duplicates so n_duplicate_characters = n_characters and the fraction 1/1. However, you can get in a situation where you count some characters twice duplicate 2-gram ("a sentence", "sentence.") to get a n/1 fraction where n>1. A solution to this would simply be to create a boolean array of characters (instead of the current array of tokens) and set the respective characters to True (I believe this would also be faster).

However with your specific example: range=(3, 7), span=test. This is, does not seem to be a duplicate, which seems to a bug, though I can't see a clear point where that error should happen.

HLasse · 2024-04-11T08:02:30Z

Is this bug something you'd have time to take a closer look at, @KennethEnevoldsen? Otherwise, a PR is more than welcome, @sondalex!

HLasse · 2024-04-22T08:09:35Z

Kenneth is busy with MTEB sprint and I'm wrapping up my dissertation, so neither of us have the time to look into this right now. You're more than welcome to submit a PR if you find a solution for this, @sondalex, otherwise it might have to wait a while before we get the chance to take a deeper look.

sondalex · 2024-04-22T08:18:14Z

Hi @KennethEnevoldsen, thank you for your explanation and thank you @HLasse for following up. I am planning to look more in depth into this when I have some time. I will definitely PR once I find a solution.

github-actions · 2024-10-21T02:37:01Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sondalex added the bug Something isn't working label Apr 9, 2024

github-actions bot added the Stale label Jul 22, 2024

Repository owner deleted a comment from github-actions bot Jul 22, 2024

github-actions bot removed the Stale label Jul 23, 2024

github-actions bot added the Stale label Oct 21, 2024

KennethEnevoldsen removed the Stale label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quality.duplicate_ngram_fraction(...) expected values ? #343

quality.duplicate_ngram_fraction(...) expected values ? #343

sondalex commented Apr 9, 2024 •

edited

Loading

HLasse commented Apr 10, 2024

KennethEnevoldsen commented Apr 10, 2024

HLasse commented Apr 11, 2024

HLasse commented Apr 22, 2024

sondalex commented Apr 22, 2024

github-actions bot commented Oct 21, 2024

quality.duplicate_ngram_fraction(...) expected values ? #343

quality.duplicate_ngram_fraction(...) expected values ? #343

Comments

sondalex commented Apr 9, 2024 • edited Loading

HLasse commented Apr 10, 2024

KennethEnevoldsen commented Apr 10, 2024

HLasse commented Apr 11, 2024

HLasse commented Apr 22, 2024

sondalex commented Apr 22, 2024

github-actions bot commented Oct 21, 2024

sondalex commented Apr 9, 2024 •

edited

Loading