Deduplication helpers for lexicon modification jobs #458

Icemole · 2023-10-27T13:50:31Z

This PR introduces some improvements to how the lemma duplication is handled in lexicon.conversion.LexiconUniqueOrthJob and lexicon.modification.MergeLexiconJob.

For the former, I added a flag that makes special lemmas go through the unique lemma creation. In my view this is the most relevant part of the PR.

For the latter, I added a flag to deduplicate any duplicated lemmas that might appear as a result of the merging. This change might be debatable since MergeLexiconJob combined with LexiconUniqueOrthJob has the same result. I can remove this flag if it's too redundant; let me know about this.

Finally, I also added a Lemma._equals method and overwrote == and != for the Lemma class, which I think might be useful in the long term (I actually used this in MergeLexiconJob).

lexicon/conversion.py

lexicon/modification.py

lib/lexicon.py

albertz · 2023-10-27T14:06:09Z

lib/lexicon.py

@@ -104,6 +107,42 @@ def from_element(cls, e):
        synt = None if not synt else synt[0]
        return Lemma(orth, phon, synt, eval, special)

+    def _equals(self, other: Lemma, same_order: bool = True) -> bool:


Why is the default same_order=True here, but the only usage of this function actually uses same_order=False?

The default should reflect the most reasonable expected common usage, or if this might be ambiguous, there should not be a default.

You're right, probably the most common usage of equality would be checking without explicit order. At least I think it that way. What would other reviewers expect from the default of this function?

albertz · 2023-10-27T14:07:52Z

lib/lexicon.py

+            )
+
+    def __eq__(self, other: Lemma) -> bool:
+        return self._equals(other, same_order=False)


Is there a reason you want same_order=False here?

It probably depends on the use case of __eq__ here (what actually is your use case?) but it feels kind of arbitrary.

This should be discussed, but in my view a lemma is equal to another even if it has different orth order, or different pronunciation order. This is why it's set to False here.

Edit: the use case is basically the one specified in MergeLexiconJob, but I think it's reasonable to think that a lemma with two pronunciations X, Y is the same lemma as any other lemma with the same pronunciations Y, X (assuming the rest of the attributes are the same as well), even if X and Y are not inserted in the exact same order.

I think not enforcing a strict ordering is the best way of comparing two lemmas, but I would understand that some users might want strict ordering comparison, which is why I added the same_order flag. Then, the user who wants to enforce strict ordering can directly call _equals.

I think it depends on the use case. Or if you say, only same_order=False makes sense anyway, then why is this an option at all for _equals?

What actually is your use case?

My use case is the following: in MergeLexiconJob I added functionality (present in this PR) that will remove "duplicated" lemmas. According to my definition, duplicated lemmas are those lemmas that have the same orthographies and pronunciations. Note that my definition implies no ordering, so I would be fine with same_order=False.

However, I wanted to give the freedom to let the user be able to decide whether the order in their orths and prons matters or not. Maybe some users might want to store the orths and prons of a lexicon in lexicographical order, and thus the order would matter. Maybe some other users already have a specific ordering, and therefore the addition of a lemma with a different order helps them notice that something's wrong with their pipeline.

Summarizing: same_order=False is enough for my needs/use case, but I added this flag since I was unsure of the other use cases for this function, since each user might have a particular use case.

If you think it won't be used and only ==/!= will be used as I have been using it (with same_order=False), let me know. I won't use this flag, and if keeping it or removing it will spark a debate, I'd rather just remove it now and let another user, who might find it more necessary, implement it.

So, if you think there is no clear non-ambiguous definition of an equal-relation, thus having such flag makes sense, then I'm not sure if defining __eq__ makes sense here. Or usually __eq__ should always be most restrictive, i.e. use same_order=True.

Now __eq__ is very arbitrary, supporting exactly your case, but as you say yourself, there might be other cases where the equality should only be with same_order=True.

I agree, __eq__ matches my use case but it doesn't cover the most restrictive use case.

What do others think? Is it too overkill to have same_order to test lemma equality? Should we just have the case of "two lemmas are equal whenever they contain the same data, regardless of their order"?

At every comment I'm more inclined to remove same_order and just check for what I said above, since I think no realistic use case would consider the ordering of data inside a lemma... At least I can't think of any specific use case for this...

In general, we should not add code/logic for cases which are only hypothetical and not used currently, but only for things we really are using currently. Thus only the logic same_order=False. And then __eq__ is also non-ambiguous.

But let's wait what other think about this.

I would consider how it is treated in RASR:

phon: order does not matter; all are hypothesized and best one is selected

orth:

training: order does not matter

recognition: the first orth is the one that is printed in the output (if no eval is given) and used for LM purposes (if no synt is given)

synt: order does matter, as it defines the order used for the LM

eval: order does matter, as it defines the order the tokens are printed for output

So for phon, synt, and eval it is (in my opinion and as far as my understanding goes) clear what to do. For orth, which is probably mostly what you are interested in, it is not immediately obvious if the order should be considered.

Co-authored-by: Albert Zeyer <[email protected]>

albertz · 2023-10-30T08:34:30Z

lexicon/modification.py

@@ -178,6 +184,10 @@ def run(self):
                for lemma in lex.lemmata:
                    # sort by first orth entry
                    orth_key = lemma.orth[0] if lemma.orth else ""
+                    if self.deduplicate_lemmata:
+                        # don't add the lemma when there's already an equal lemma
+                        if len(lemma_dict[orth_key]) > 0 and lemma == lemma_dict[orth_key][0]:


I realize, this == uses your definition below. Maybe you should use is_equal(other, same_order=False) here.

Yes, this is the use case what I was referring to, I might not have made this clear.

Right now I'm even more inclined to remove the same_order flag altogether and just keep same_order=False as the default. I'm waiting for what the rest of the reviewers might have to say about this.

In general, thinking about use cases as you said, I couldn't think of any actual use case for same_order=True that wouldn't be covered by same_order=False, only hypothetical situations in which the user might want the same order, but I don't think that would have any sort of impact on the final lexicon.

albertz · 2023-10-30T08:39:21Z

lexicon/conversion.py

@@ -119,11 +119,14 @@ def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False):
                    ** having a synt <=> synt is not None
                this could lead to INFORMATION LOSS if there are several
                different synt token sequences in the to-be-merged lemmata
+        :param bool deduplicate_special_lemmata: if True, special lemmata will also be considered


If you think there are possibly cases where same_order=True makes sense, maybe this should not just be a bool but sth like deduplicate_special_lemmata_type or deduplicate_special_lemmata_opts?

JackTemaki

I would like to have the option of same_order in there, as there are some setups that just pick the first phon sequence, and then the order matters.

Thus _equals should actually be equals, so that it is not private but is supposed to be used outside.

Icemole added 3 commits October 27, 2023 09:40

Implement lemma equality

5c4e2a6

Lemma deduplication helper on unique orth job

dee0609

Lemma deduplication helper on merge lexicon job

ba03170

Icemole requested review from albertz, curufinwe, christophmluscher, sarahberanek and JackTemaki October 27, 2023 13:50

albertz reviewed Oct 27, 2023

View reviewed changes

lexicon/conversion.py Outdated Show resolved Hide resolved

albertz reviewed Oct 27, 2023

View reviewed changes

lexicon/modification.py Outdated Show resolved Hide resolved

albertz reviewed Oct 27, 2023

View reviewed changes

lib/lexicon.py Outdated Show resolved Hide resolved

albertz reviewed Oct 27, 2023

View reviewed changes

Icemole and others added 3 commits October 27, 2023 17:11

Reformat docstring

6f75911

Co-authored-by: Albert Zeyer <[email protected]>

Force same_order as kwarg

db4e426

Co-authored-by: Albert Zeyer <[email protected]>

Remove print

73ff8e6

Co-authored-by: Albert Zeyer <[email protected]>

albertz reviewed Oct 30, 2023

View reviewed changes

JackTemaki requested changes Jan 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication helpers for lexicon modification jobs #458

Deduplication helpers for lexicon modification jobs #458

Icemole commented Oct 27, 2023

albertz Oct 27, 2023

Icemole Oct 27, 2023

albertz Oct 27, 2023

Icemole Oct 27, 2023 •

edited

Loading

albertz Oct 27, 2023 •

edited

Loading

Icemole Oct 30, 2023

albertz Oct 30, 2023

Icemole Oct 30, 2023

albertz Oct 30, 2023

michelwi Dec 4, 2023

albertz Oct 30, 2023

Icemole Oct 30, 2023

albertz Oct 30, 2023

JackTemaki left a comment

Deduplication helpers for lexicon modification jobs #458

Are you sure you want to change the base?

Deduplication helpers for lexicon modification jobs #458

Conversation

Icemole commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Icemole Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

albertz Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackTemaki left a comment

Choose a reason for hiding this comment

Icemole Oct 27, 2023 •

edited

Loading

albertz Oct 27, 2023 •

edited

Loading