Strip newlines when parsing `TextDict`s to avoid `OverflowError` #540

NeoLegends · 2024-09-03T08:24:27Z

Closes #539

albertz · 2024-09-03T09:05:37Z

util.py

+
+def parse_text_dict(path: Union[str, tk.Path]) -> Dict[str, str]:
+    """
+    Loads the text dict at :param:`path` making sure not to trigger line counter overflow.


Can you expand this explanation?

Put a ref to the issue here: Parsing a large TextDict fails on recent python versions #539

Also quote the exception you got: OverflowError: line number table is too long

Explain the exception. What does it mean? Why does this function fix it? I don't really understand what the exception means. My guess would be that the dict becomes too big. But then this function here would have the same problem, because as I understand it, you intend to create just the same dict, just in a different way?

albertz · 2024-09-03T09:09:23Z

util.py

+    """
+
+    with uopen(path, "rt") as text_dict_file:
+        txt = text_dict_file.read()


I wonder a bit: as I understand it, your problem is that the dict is really huge (how huge? how many entries? how big is the file?). But then, isn't it a problem to load it all into memory? Also via the following code (strip, splitlines, etc), you are even creating multiple copies of the data in memory.

It is my understanding that the memory itself is not an issue, but rather that the file in my case is about 280k lines long, which is where python's line counter overflows for some reason. This is why I'm not worried about memory consumption itself in this module, but rather just the way the string is parsed.

What is the Python line counter? There are no "lines" in a dict? Or you mean for Python source code?

But anyway, you should add this explanation to the code, so that you can understand why we have this code there.

util.py

albertz · 2024-09-03T09:14:24Z

util.py

+        # parse chunkwise to avoid line counter overflow when the text dict is very large
+        for chunk in chunks(lines, max(1, len(lines) // 1000))
+        for k, v in eval("\n".join(["{", *chunk, "}"]), {"nan": float("nan"), "inf": float("inf")}).items()


This code does not work in all cases. E.g. consider a dict formatted as this:

{ "a": 42 }

(Yea, this probably will not happen when it was written with some of our tools, but who knows, maybe it comes from elsewhere, and/or there was some black on the generated output, or whatever.)
This should at least be discussed in a code comment.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

This should at least be discussed in a code comment.

Yes, I'll add one.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

That implementation contains a native C++ module, I don't think this should be made part of i6_core.

Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this? All this custom code would never have had to be written and the issue here would likely have never occured in the first place with a proper data exchange format.

native C++ module

Well, you could argue, the Python stdlib is full of native code... I don't really see being "native" a problem here. The main problem is maybe that then RETURNN becomes a dependency for those specific jobs. That's maybe unwanted. But I'm not sure if it's really so much a problem. But then, maybe we can also use this code here for now and keep that in mind. (Maybe add it in a comment, that this is an alternative.)

Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this?

My assumption was, it's faster and simpler. Those assumptions turned out to be wrong though, so yes, JSON would have been better.

Python literals are also more flexible and generic though, but in many cases (like here), this is not really needed, and/or you can work around that. But JSON is really restricted. I think it also does not support inf/nan, so it doesn't properly work for N-best lists (rarely, but sometimes we get inf in the score; it would be annoying if it then rarely crashes, and/or you need a stupid workaround just to clip it to 1e30 or so).

NeoLegends · 2024-09-03T09:54:50Z

I've noticed you can also simply strip any newline from the text dict before parsing, this also avoids the error. Newer python versions than 3.10 also don't seem to run into this error anymore.

albertz · 2024-09-03T10:28:40Z

I've noticed you can also simply strip any newline from the text dict before parsing,

But this can break the code even more, and even silently? The other complaint I had before would at least fail with an error if the chunking is incorrect. This stripping can corrupt the data silently without error. I definitely would not do that.

Example:

{
"a":
"""
a
b
c
"""
}

NeoLegends requested review from albertz, christophmluscher, michelwi, vieting, Atticus1806 and robin-p-schmitt September 3, 2024 08:24

NeoLegends self-assigned this Sep 3, 2024

NeoLegends force-pushed the moritz-line-counter-overflow branch 2 times, most recently from e45ac8a to c646539 Compare September 3, 2024 08:26

Parse TextDicts chunkwise to avoid OverflowError

d0cf85e

NeoLegends force-pushed the moritz-line-counter-overflow branch from c646539 to d0cf85e Compare September 3, 2024 08:30

NeoLegends requested a review from Stefanwuu September 3, 2024 08:40

albertz reviewed Sep 3, 2024

View reviewed changes

util.py Outdated Show resolved Hide resolved

albertz reviewed Sep 3, 2024

View reviewed changes

Simplify implementation, simply strip newlines

f25fb80

NeoLegends changed the title ~~Parse TextDicts chunkwise to avoid OverflowError~~ Strip newlines when parsing TextDicts to avoid OverflowError Sep 3, 2024

improve docs

0721c94

file iter doesn't strip newlines

f116329

Merge branch 'main' into moritz-line-counter-overflow

323eeaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip newlines when parsing `TextDict`s to avoid `OverflowError` #540

Strip newlines when parsing `TextDict`s to avoid `OverflowError` #540

NeoLegends commented Sep 3, 2024

albertz Sep 3, 2024

albertz Sep 3, 2024

NeoLegends Sep 3, 2024

albertz Sep 3, 2024

albertz Sep 3, 2024

albertz Sep 3, 2024 •

edited

Loading

NeoLegends Sep 3, 2024

albertz Sep 3, 2024

NeoLegends commented Sep 3, 2024

albertz commented Sep 3, 2024

Strip newlines when parsing TextDicts to avoid OverflowError #540

Are you sure you want to change the base?

Strip newlines when parsing TextDicts to avoid OverflowError #540

Conversation

NeoLegends commented Sep 3, 2024

albertz Sep 3, 2024

Choose a reason for hiding this comment

albertz Sep 3, 2024

Choose a reason for hiding this comment

NeoLegends Sep 3, 2024

Choose a reason for hiding this comment

albertz Sep 3, 2024

Choose a reason for hiding this comment

albertz Sep 3, 2024

Choose a reason for hiding this comment

albertz Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

NeoLegends Sep 3, 2024

Choose a reason for hiding this comment

albertz Sep 3, 2024

Choose a reason for hiding this comment

NeoLegends commented Sep 3, 2024

albertz commented Sep 3, 2024

Strip newlines when parsing `TextDict`s to avoid `OverflowError` #540

Strip newlines when parsing `TextDict`s to avoid `OverflowError` #540

albertz Sep 3, 2024 •

edited

Loading