Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve iteration for multiple documents in one file. #120

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

contradict
Copy link

The first patch implements some missing functions from the iteration interface to describe the iterator, utilities like collect need use some of these.

The second patch fixes load_all_file to not close the file before it is done reading it.

I also added a test for each of these.

@codecov
Copy link

codecov bot commented Sep 8, 2021

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.23%. Comparing base (c6ac0b3) to head (b13ddde).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   86.07%   87.23%   +1.16%     
==========================================
  Files          15       15              
  Lines        1616     1669      +53     
==========================================
+ Hits         1391     1456      +65     
+ Misses        225      213      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Paalon Paalon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A YAML document can be a single node with a scalar or a sequence, so Base.eltype should not return Dict{Any, Any}.

@contradict
Copy link
Author

Good point!

I suppose the safest thing to do is change IteratorEltype to return EltypeUnknown() since a file could switch back and forth between complex documents and single scalar documents.

Is this package now maintained? I'd be happy to keep working on this if there are signs of life!

@kescobo
Copy link
Collaborator

kescobo commented Jun 7, 2024

@contradict it is very loosely (and poorly) maintained by me. It's kind of an accident - I didn't write the package, have never dug into the internals, and don't use it day-to-day.

In general, I'm happy to review PRs, and if they provide tests and don't break anything, I'll usually merge and release, though I may need some reminders to pay attention. Sorry this languished so long.

@contradict
Copy link
Author

Ok, rebased and comment addressed. Unfortunately, this is still broken on Windows. I posted here looking for advice when I made this PR and I haven't made any progress on that part of the problem since.

src/YAML.jl Outdated
load_all(input, args...; kwargs...)
end

load_all(open(filename, "r"), args...; kwargs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly sure this won't close the file, in contrast to the open(...) do construction. Is that intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since parsing is lazy, using do here results in the file being closed before anything is read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but it's still problematic to leave a file unclosed. It's entirely possible that's the cause of the Windows failures and even if it seems to work on the other platforms, it will be a file descriptor leak and if you read too many YAML files this way you will eventually fail because you have run out of file descriptors.

There are some options to handle this, with varying advantages and disadvantages.

  1. Design the API with a manually called close operation.
  2. Set up a finalizer in YAMLDocIterator which closes the file when the iterator is garbage collected.
  3. Read the raw file eagerly and do the lazy parsing from an internal IOBuffer. That way the file can be closed immediately after reading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking closer at the code I think 3 is the better option. The main drawback is that it requires memory to store the raw file contents but if your file is large enough that that becomes a problem, you should use the load_all(::IO) method directly instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll try again with that approach. Though a quick test shows that the finalizer attached during open does eventually close the file in the approach used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation that the IOStream object would have a finalizer. Unfortunately there are no guarantees that it will run before you run out of file descriptors, so better avoid relying on it.

when parsing multi-document files. Since parsing is lazy, just using
`do` results in the file being closed before parsing happens. Leaving
the file open and wating for the finalizer to close it was deemed
undesirable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants