Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication helpers for lexicon modification jobs #458

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions lexicon/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,9 @@ def run(self):
class LexiconUniqueOrthJob(Job):
"""Merge lemmata with the same orthography."""

__sis_hash_exclude__ = {"merge_multi_orths_lemmata": False}
__sis_hash_exclude__ = {"merge_multi_orths_lemmata": False, "deduplicate_special_lemmata": False}

def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False):
def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False, deduplicate_special_lemmata=False):
"""
:param tk.Path bliss_lexicon: lexicon file to be handeled
:param bool merge_multi_orths_lemmata: if True, also lemmata containing
Expand All @@ -119,11 +119,14 @@ def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False):
** having a synt <=> synt is not None
this could lead to INFORMATION LOSS if there are several
different synt token sequences in the to-be-merged lemmata
:param bool deduplicate_special_lemmata: if True, special lemmata will also be considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think there are possibly cases where same_order=True makes sense, maybe this should not just be a bool but sth like deduplicate_special_lemmata_type or deduplicate_special_lemmata_opts?

in the unique process.
Icemole marked this conversation as resolved.
Show resolved Hide resolved
"""
self.set_vis_name("Make Lexicon Orths Unique")

self.bliss_lexicon = bliss_lexicon
self.merge_multi_orths_lemmata = merge_multi_orths_lemmata
self.deduplicate_special_lemmata = deduplicate_special_lemmata

self.out_bliss_lexicon = self.output_path(os.path.basename(tk.uncached_path(bliss_lexicon)), cached=True)

Expand All @@ -137,7 +140,7 @@ def run(self):
orth2lemmata = collections.defaultdict(list)

for lemma in lex.lemmata:
if lemma.special:
if lemma.special and not self.deduplicate_special_lemmata:
continue
num_orths = len(lemma.orth)
if num_orths < 1:
Expand Down
13 changes: 12 additions & 1 deletion lexicon/modification.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,16 +130,22 @@ class MergeLexiconJob(Job):
will create a new lexicon that might be incompatible to previously generated alignments.
"""

def __init__(self, bliss_lexica, sort_phonemes=False, sort_lemmata=False, compressed=True):
__sis_hash_exclude__ = {"deduplicate_lemmata": False}

def __init__(
self, bliss_lexica, sort_phonemes=False, sort_lemmata=False, compressed=True, deduplicate_lemmata=False
):
"""
:param list[Path] bliss_lexica: list of bliss lexicon files (plain or gz)
:param bool sort_phonemes: sort phoneme inventory alphabetically
:param bool sort_lemmata: sort lemmata alphabetically based on first orth entry
:param bool compressed: compress final lexicon
:param bool deduplicate_lemmata: whether to deduplicate lemmatas, only applied when sort_lemmata=True
"""
self.lexica = bliss_lexica
self.sort_phonemes = sort_phonemes
self.sort_lemmata = sort_lemmata
self.deduplicate_lemmata = deduplicate_lemmata

self.out_bliss_lexicon = self.output_path("lexicon.xml.gz" if compressed else "lexicon.xml")

Expand Down Expand Up @@ -178,7 +184,12 @@ def run(self):
for lemma in lex.lemmata:
# sort by first orth entry
orth_key = lemma.orth[0] if lemma.orth else ""
if self.deduplicate_lemmata:
# don't add the lemma when there's already an equal lemma
if len(lemma_dict[orth_key]) > 0 and lemma == lemma_dict[orth_key][0]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize, this == uses your definition below. Maybe you should use is_equal(other, same_order=False) here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the use case what I was referring to, I might not have made this clear.

Right now I'm even more inclined to remove the same_order flag altogether and just keep same_order=False as the default. I'm waiting for what the rest of the reviewers might have to say about this.

In general, thinking about use cases as you said, I couldn't think of any actual use case for same_order=True that wouldn't be covered by same_order=False, only hypothetical situations in which the user might want the same order, but I don't think that would have any sort of impact on the final lexicon.

continue
lemma_dict[orth_key].append(lemma)
print(lemma_dict)
Icemole marked this conversation as resolved.
Show resolved Hide resolved
merged_lex.lemmata = list(itertools.chain(*[lemma_dict[key] for key in sorted(lemma_dict.keys())]))
else:
for lex in lexica:
Expand Down
41 changes: 40 additions & 1 deletion lib/lexicon.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,13 @@

For format details visit: `https://www-i6.informatik.rwth-aachen.de/rwth-asr/manual/index.php/Lexicon`_
"""
from __future__ import annotations

__all__ = ["Lemma", "Lexicon"]

from collections import OrderedDict
from typing import Optional, List
import itertools
from typing import Optional, List, Set
import xml.etree.ElementTree as ET

from i6_core.util import uopen
Expand Down Expand Up @@ -104,6 +107,42 @@ def from_element(cls, e):
synt = None if not synt else synt[0]
return Lemma(orth, phon, synt, eval, special)

def _equals(self, other: Lemma, same_order: bool = True) -> bool:
Icemole marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the default same_order=True here, but the only usage of this function actually uses same_order=False?

The default should reflect the most reasonable expected common usage, or if this might be ambiguous, there should not be a default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, probably the most common usage of equality would be checking without explicit order. At least I think it that way. What would other reviewers expect from the default of this function?

"""
Check for lemma equality.

:param other: Other lemma to compare :param:`self` to.
:param same_order: Whether the order in the different lemma elements matters or not.
:return: Whether :param:`self` and :param:`other` are equal or not.
"""
if same_order:
return (
self.orth == other.orth
and self.phon == other.phon
and self.special == other.special
and self.synt == other.synt
and self.eval == other.eval
)
else:
if self.synt is not None and other.synt is not None:
equal_synt = set(self.synt) == set(other.synt)
else:
equal_synt = self.synt == other.synt

return (
set(self.orth) == set(other.orth)
and set(self.phon) == set(other.phon)
and self.special == other.special
and equal_synt
and set(itertools.chain(*self.eval)) == set(itertools.chain(*other.eval))
)

def __eq__(self, other: Lemma) -> bool:
return self._equals(other, same_order=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you want same_order=False here?

It probably depends on the use case of __eq__ here (what actually is your use case?) but it feels kind of arbitrary.

Copy link
Collaborator Author

@Icemole Icemole Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be discussed, but in my view a lemma is equal to another even if it has different orth order, or different pronunciation order. This is why it's set to False here.

Edit: the use case is basically the one specified in MergeLexiconJob, but I think it's reasonable to think that a lemma with two pronunciations X, Y is the same lemma as any other lemma with the same pronunciations Y, X (assuming the rest of the attributes are the same as well), even if X and Y are not inserted in the exact same order.

I think not enforcing a strict ordering is the best way of comparing two lemmas, but I would understand that some users might want strict ordering comparison, which is why I added the same_order flag. Then, the user who wants to enforce strict ordering can directly call _equals.

Copy link
Member

@albertz albertz Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the use case. Or if you say, only same_order=False makes sense anyway, then why is this an option at all for _equals?

What actually is your use case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My use case is the following: in MergeLexiconJob I added functionality (present in this PR) that will remove "duplicated" lemmas. According to my definition, duplicated lemmas are those lemmas that have the same orthographies and pronunciations. Note that my definition implies no ordering, so I would be fine with same_order=False.

However, I wanted to give the freedom to let the user be able to decide whether the order in their orths and prons matters or not. Maybe some users might want to store the orths and prons of a lexicon in lexicographical order, and thus the order would matter. Maybe some other users already have a specific ordering, and therefore the addition of a lemma with a different order helps them notice that something's wrong with their pipeline.

Summarizing: same_order=False is enough for my needs/use case, but I added this flag since I was unsure of the other use cases for this function, since each user might have a particular use case.

If you think it won't be used and only ==/!= will be used as I have been using it (with same_order=False), let me know. I won't use this flag, and if keeping it or removing it will spark a debate, I'd rather just remove it now and let another user, who might find it more necessary, implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you think there is no clear non-ambiguous definition of an equal-relation, thus having such flag makes sense, then I'm not sure if defining __eq__ makes sense here. Or usually __eq__ should always be most restrictive, i.e. use same_order=True.

Now __eq__ is very arbitrary, supporting exactly your case, but as you say yourself, there might be other cases where the equality should only be with same_order=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, __eq__ matches my use case but it doesn't cover the most restrictive use case.

What do others think? Is it too overkill to have same_order to test lemma equality? Should we just have the case of "two lemmas are equal whenever they contain the same data, regardless of their order"?

At every comment I'm more inclined to remove same_order and just check for what I said above, since I think no realistic use case would consider the ordering of data inside a lemma... At least I can't think of any specific use case for this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should not add code/logic for cases which are only hypothetical and not used currently, but only for things we really are using currently. Thus only the logic same_order=False. And then __eq__ is also non-ambiguous.

But let's wait what other think about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider how it is treated in RASR:

  • phon: order does not matter; all are hypothesized and best one is selected
  • orth:
    • training: order does not matter
    • recognition: the first orth is the one that is printed in the output (if no eval is given) and used for LM purposes (if no synt is given)
  • synt: order does matter, as it defines the order used for the LM
  • eval: order does matter, as it defines the order the tokens are printed for output

So for phon, synt, and eval it is (in my opinion and as far as my understanding goes) clear what to do. For orth, which is probably mostly what you are interested in, it is not immediately obvious if the order should be considered.


def __ne__(self, other: Lemma) -> bool:
return not self.__eq__(other)


class Lexicon:
"""
Expand Down
Loading