-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication helpers for lexicon modification jobs #458
base: main
Are you sure you want to change the base?
Conversation
lib/lexicon.py
Outdated
@@ -104,6 +107,42 @@ def from_element(cls, e): | |||
synt = None if not synt else synt[0] | |||
return Lemma(orth, phon, synt, eval, special) | |||
|
|||
def _equals(self, other: Lemma, same_order: bool = True) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the default same_order=True
here, but the only usage of this function actually uses same_order=False
?
The default should reflect the most reasonable expected common usage, or if this might be ambiguous, there should not be a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, probably the most common usage of equality would be checking without explicit order. At least I think it that way. What would other reviewers expect from the default of this function?
) | ||
|
||
def __eq__(self, other: Lemma) -> bool: | ||
return self._equals(other, same_order=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you want same_order=False
here?
It probably depends on the use case of __eq__
here (what actually is your use case?) but it feels kind of arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be discussed, but in my view a lemma is equal to another even if it has different orth order, or different pronunciation order. This is why it's set to False
here.
Edit: the use case is basically the one specified in MergeLexiconJob
, but I think it's reasonable to think that a lemma with two pronunciations X, Y is the same lemma as any other lemma with the same pronunciations Y, X (assuming the rest of the attributes are the same as well), even if X and Y are not inserted in the exact same order.
I think not enforcing a strict ordering is the best way of comparing two lemmas, but I would understand that some users might want strict ordering comparison, which is why I added the same_order
flag. Then, the user who wants to enforce strict ordering can directly call _equals
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on the use case. Or if you say, only same_order=False
makes sense anyway, then why is this an option at all for _equals
?
What actually is your use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My use case is the following: in MergeLexiconJob
I added functionality (present in this PR) that will remove "duplicated" lemmas. According to my definition, duplicated lemmas are those lemmas that have the same orthographies and pronunciations. Note that my definition implies no ordering, so I would be fine with same_order=False
.
However, I wanted to give the freedom to let the user be able to decide whether the order in their orths and prons matters or not. Maybe some users might want to store the orths and prons of a lexicon in lexicographical order, and thus the order would matter. Maybe some other users already have a specific ordering, and therefore the addition of a lemma with a different order helps them notice that something's wrong with their pipeline.
Summarizing: same_order=False
is enough for my needs/use case, but I added this flag since I was unsure of the other use cases for this function, since each user might have a particular use case.
If you think it won't be used and only ==
/!=
will be used as I have been using it (with same_order=False
), let me know. I won't use this flag, and if keeping it or removing it will spark a debate, I'd rather just remove it now and let another user, who might find it more necessary, implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you think there is no clear non-ambiguous definition of an equal-relation, thus having such flag makes sense, then I'm not sure if defining __eq__
makes sense here. Or usually __eq__
should always be most restrictive, i.e. use same_order=True
.
Now __eq__
is very arbitrary, supporting exactly your case, but as you say yourself, there might be other cases where the equality should only be with same_order=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, __eq__
matches my use case but it doesn't cover the most restrictive use case.
What do others think? Is it too overkill to have same_order
to test lemma equality? Should we just have the case of "two lemmas are equal whenever they contain the same data, regardless of their order"?
At every comment I'm more inclined to remove same_order
and just check for what I said above, since I think no realistic use case would consider the ordering of data inside a lemma... At least I can't think of any specific use case for this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should not add code/logic for cases which are only hypothetical and not used currently, but only for things we really are using currently. Thus only the logic same_order=False
. And then __eq__
is also non-ambiguous.
But let's wait what other think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider how it is treated in RASR:
- phon: order does not matter; all are hypothesized and best one is selected
- orth:
- training: order does not matter
- recognition: the first orth is the one that is printed in the output (if no eval is given) and used for LM purposes (if no synt is given)
- synt: order does matter, as it defines the order used for the LM
- eval: order does matter, as it defines the order the tokens are printed for output
So for phon, synt, and eval it is (in my opinion and as far as my understanding goes) clear what to do. For orth, which is probably mostly what you are interested in, it is not immediately obvious if the order should be considered.
Co-authored-by: Albert Zeyer <[email protected]>
Co-authored-by: Albert Zeyer <[email protected]>
Co-authored-by: Albert Zeyer <[email protected]>
@@ -178,6 +184,10 @@ def run(self): | |||
for lemma in lex.lemmata: | |||
# sort by first orth entry | |||
orth_key = lemma.orth[0] if lemma.orth else "" | |||
if self.deduplicate_lemmata: | |||
# don't add the lemma when there's already an equal lemma | |||
if len(lemma_dict[orth_key]) > 0 and lemma == lemma_dict[orth_key][0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize, this ==
uses your definition below. Maybe you should use is_equal(other, same_order=False)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the use case what I was referring to, I might not have made this clear.
Right now I'm even more inclined to remove the same_order
flag altogether and just keep same_order=False
as the default. I'm waiting for what the rest of the reviewers might have to say about this.
In general, thinking about use cases as you said, I couldn't think of any actual use case for same_order=True
that wouldn't be covered by same_order=False
, only hypothetical situations in which the user might want the same order, but I don't think that would have any sort of impact on the final lexicon.
@@ -119,11 +119,14 @@ def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False): | |||
** having a synt <=> synt is not None | |||
this could lead to INFORMATION LOSS if there are several | |||
different synt token sequences in the to-be-merged lemmata | |||
:param bool deduplicate_special_lemmata: if True, special lemmata will also be considered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think there are possibly cases where same_order=True
makes sense, maybe this should not just be a bool but sth like deduplicate_special_lemmata_type
or deduplicate_special_lemmata_opts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have the option of same_order
in there, as there are some setups that just pick the first phon sequence, and then the order matters.
Thus _equals
should actually be equals
, so that it is not private but is supposed to be used outside.
This PR introduces some improvements to how the lemma duplication is handled in
lexicon.conversion.LexiconUniqueOrthJob
andlexicon.modification.MergeLexiconJob
.For the former, I added a flag that makes special lemmas go through the unique lemma creation. In my view this is the most relevant part of the PR.
For the latter, I added a flag to deduplicate any duplicated lemmas that might appear as a result of the merging. This change might be debatable since
MergeLexiconJob
combined withLexiconUniqueOrthJob
has the same result. I can remove this flag if it's too redundant; let me know about this.Finally, I also added a
Lemma._equals
method and overwrote==
and!=
for theLemma
class, which I think might be useful in the long term (I actually used this inMergeLexiconJob
).