-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip newlines when parsing TextDict
s to avoid OverflowError
#540
base: main
Are you sure you want to change the base?
Conversation
e45ac8a
to
c646539
Compare
c646539
to
d0cf85e
Compare
util.py
Outdated
|
||
def parse_text_dict(path: Union[str, tk.Path]) -> Dict[str, str]: | ||
""" | ||
Loads the text dict at :param:`path` making sure not to trigger line counter overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand this explanation?
- Put a ref to the issue here: Parsing a large TextDict fails on recent python versions #539
- Also quote the exception you got:
OverflowError: line number table is too long
- Explain the exception. What does it mean? Why does this function fix it? I don't really understand what the exception means. My guess would be that the dict becomes too big. But then this function here would have the same problem, because as I understand it, you intend to create just the same dict, just in a different way?
util.py
Outdated
""" | ||
|
||
with uopen(path, "rt") as text_dict_file: | ||
txt = text_dict_file.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder a bit: as I understand it, your problem is that the dict is really huge (how huge? how many entries? how big is the file?). But then, isn't it a problem to load it all into memory? Also via the following code (strip
, splitlines
, etc), you are even creating multiple copies of the data in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that the memory itself is not an issue, but rather that the file in my case is about 280k lines long, which is where python's line counter overflows for some reason. This is why I'm not worried about memory consumption itself in this module, but rather just the way the string is parsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the Python line counter? There are no "lines" in a dict? Or you mean for Python source code?
But anyway, you should add this explanation to the code, so that you can understand why we have this code there.
util.py
Outdated
# parse chunkwise to avoid line counter overflow when the text dict is very large | ||
for chunk in chunks(lines, max(1, len(lines) // 1000)) | ||
for k, v in eval("\n".join(["{", *chunk, "}"]), {"nan": float("nan"), "inf": float("inf")}).items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code does not work in all cases. E.g. consider a dict formatted as this:
{
"a":
42
}
(Yea, this probably will not happen when it was written with some of our tools, but who knows, maybe it comes from elsewhere, and/or there was some black
on the generated output, or whatever.)
This should at least be discussed in a code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval
, which would solve this problem in another way, which should be faster and more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should at least be discussed in a code comment.
Yes, I'll add one.
Note, in RETURNN, we use
returnn.util.literal_py_to_pickle.literal_eval
, which would solve this problem in another way, which should be faster and more generic.
That implementation contains a native C++ module, I don't think this should be made part of i6_core
.
Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this? All this custom code would never have had to be written and the issue here would likely have never occured in the first place with a proper data exchange format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
native C++ module
Well, you could argue, the Python stdlib is full of native code... I don't really see being "native" a problem here. The main problem is maybe that then RETURNN becomes a dependency for those specific jobs. That's maybe unwanted. But I'm not sure if it's really so much a problem. But then, maybe we can also use this code here for now and keep that in mind. (Maybe add it in a comment, that this is an alternative.)
Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this?
My assumption was, it's faster and simpler. Those assumptions turned out to be wrong though, so yes, JSON would have been better.
Python literals are also more flexible and generic though, but in many cases (like here), this is not really needed, and/or you can work around that. But JSON is really restricted. I think it also does not support inf/nan, so it doesn't properly work for N-best lists (rarely, but sometimes we get inf in the score; it would be annoying if it then rarely crashes, and/or you need a stupid workaround just to clip it to 1e30 or so).
TextDict
s chunkwise to avoid OverflowError
TextDict
s to avoid OverflowError
I've noticed you can also simply strip any newline from the text dict before parsing, this also avoids the error. Newer python versions than 3.10 also don't seem to run into this error anymore. |
But this can break the code even more, and even silently? The other complaint I had before would at least fail with an error if the chunking is incorrect. This stripping can corrupt the data silently without error. I definitely would not do that. Example: {
"a":
"""
a
b
c
"""
} |
Closes #539