Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quality.duplicate_ngram_fraction(...) expected values ? #343

Open
sondalex opened this issue Apr 9, 2024 · 6 comments
Open

quality.duplicate_ngram_fraction(...) expected values ? #343

sondalex opened this issue Apr 9, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@sondalex
Copy link
Contributor

sondalex commented Apr 9, 2024

I have added unit test for duplicate_ngram_fraction on my fork's branch add_quality_unittest . The test fails which makes wonder whether my test expected values are wrong or whether the current behaviour is not intended.

The report of my test pytest tests/test_quality.py -k test_duplicate_ngram_fraction:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/sondalex/alex/Source/TextDescriptives
configfile: pyproject.toml
collected 50 items / 49 deselected / 1 selected

tests/test_quality.py F                                                  [100%]

=================================== FAILURES ===================================
________________________ test_duplicate_ngram_fraction _________________________

nlp = <spacy.lang.en.English object at 0x7efec116b730>

    def test_duplicate_ngram_fraction(nlp):
        text = "This is a test. This is another test."
        doc = nlp(text)
        span = doc[0:]
        result = duplicate_ngram_fraction(span, (2, 3))
    
        expected = {2: (len("This is") * 2 + len("test.") * 2) / len(span.text), 3: 0}
>       assert result == expected
E       assert {2: 0.6756756...56757, 3: 0.0} == {2: 0.6486486486486487, 3: 0}
E         
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {2: 0.6756756756756757} != {2: 0.6486486486486487}
E         Use -v to get more diff

tests/test_quality.py:161: AssertionError
=========================== short test summary info ============================
FAILED tests/test_quality.py::test_duplicate_ngram_fraction - assert {2: 0.67...
======================= 1 failed, 49 deselected in 2.18s =======================

With a bit more debug information:

------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------
Spans of ngrams with count higher than 1:
This is
This is
test.
test.

-------------------

Spans tagged as duplicates:
range=(0, 2), span=This is
range=(3, 7), span=test. This is
range=(8, 10), span=test.

...

(result for n=2)

Should the spans be equivalent across the two for loop blocks ?

Respectively,

  1. for ngram_size, ngrams in shingles_count.items():
    # create a boolean array of the same length as the text
    # where True indicates that the token is a duplicate
    is_duplicate = np.zeros(max_len, dtype=bool)
    # set duplicate tokens to True
    for ngram, count in ngrams.items():
    if count["count"] > 1: # type: ignore
    for ngram_span in count["span"]: # type: ignore
    is_duplicate[ngram_span.start : ngram_span.end] = True
  2. for start, end in get_ranges(is_duplicate):
    _span = span[start:end]
    duplicate_chars += _span.end_char - _span.start_char

In this case, the function could be modified as such:

def duplicate_ngram_fraction(
    span: Union[Span, Doc],
    ngram_range: Tuple[int, int],
) -> Dict[int, float]:
    """
    ...
    """
    chr_len = len(span.text)
    if chr_len == 0:
        return {n: 0.0 for n in range(ngram_range[0], ngram_range[1] + 1)}
    shingles_count = span_ngrams(span, ngram_range)
    duplicate_chr_fraction = {}
    for ngram_size, ngrams in shingles_count.items():
        duplicate_chars = 0
        for ngram, count in ngrams.items():
            if count["count"] > 1:
                for ngram_span in count["span"]:
                    duplicate_chars += ngram_span.end_char - ngram_span.start_char


        duplicate_chr_fraction[ngram_size] = duplicate_chars / chr_len
    return duplicate_chr_fraction
@sondalex sondalex added the bug Something isn't working label Apr 9, 2024
@HLasse
Copy link
Owner

HLasse commented Apr 10, 2024

Thanks for the report! @KennethEnevoldsen, this is something you and Dan worked on - can you take a look?

@KennethEnevoldsen
Copy link
Collaborator

Hi @sondalex, thanks for the report. The reason why we use the boolean array is to avoid counting duplicates twice. E.g. imagine the sentence:
"a sentence. a sentence."

It is clearly all duplicates so n_duplicate_characters = n_characters and the fraction 1/1. However, you can get in a situation where you count some characters twice duplicate 2-gram ("a sentence", "sentence.") to get a n/1 fraction where n>1. A solution to this would simply be to create a boolean array of characters (instead of the current array of tokens) and set the respective characters to True (I believe this would also be faster).

However with your specific example: range=(3, 7), span=test. This is, does not seem to be a duplicate, which seems to a bug, though I can't see a clear point where that error should happen.

@HLasse
Copy link
Owner

HLasse commented Apr 11, 2024

Is this bug something you'd have time to take a closer look at, @KennethEnevoldsen? Otherwise, a PR is more than welcome, @sondalex!

@HLasse
Copy link
Owner

HLasse commented Apr 22, 2024

Kenneth is busy with MTEB sprint and I'm wrapping up my dissertation, so neither of us have the time to look into this right now. You're more than welcome to submit a PR if you find a solution for this, @sondalex, otherwise it might have to wait a while before we get the chance to take a deeper look.

@sondalex
Copy link
Contributor Author

Hi @KennethEnevoldsen, thank you for your explanation and thank you @HLasse for following up. I am planning to look more in depth into this when I have some time. I will definitely PR once I find a solution.

@github-actions github-actions bot added the Stale label Jul 22, 2024
Repository owner deleted a comment from github-actions bot Jul 22, 2024
@github-actions github-actions bot removed the Stale label Jul 23, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants