Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SASLib.open hangs for large files #50

Open
xiaodaigh opened this issue Aug 15, 2018 · 4 comments
Open

SASLib.open hangs for large files #50

xiaodaigh opened this issue Aug 15, 2018 · 4 comments

Comments

@xiaodaigh
Copy link
Contributor

I have a large (144G) SAS dataset that sits on a share drive and I wanted to see how quickly I can read this file incrementally. handler = SASLib.open(fp) was taking a long time, it's been 10mins and still not returned.

So I tried to trace through the code and I found that the code hangs at read_file_metadata(handler) and the I found the comment which reads

TODO however, this is inefficient since it reads a lot of data from disk

TODO can we tell if metadata is complete and break out of loop early?

So I think this is a known issue.

@xiaodaigh xiaodaigh changed the title SASLib.open hangs at SASLib.open hangs for large files Aug 15, 2018
@tk3369
Copy link
Owner

tk3369 commented Aug 15, 2018

This inefficiency was introduced from fix #44 (see comments there). If you can't wait for the fix and if you use Julia 0.6 then you can just pin to the prior stable release v0.5.2.

As for the proper fix, I'm uncertain what's the best way. I was surprised to find meta data being split into the beginning and end of the file. Should I assume that meta data may appear at the last page and jump directly there? What if there are 2 or more meta pages at the end? Another option would be to let the user specify a "force option" that make the program break out of the loop without checking the end. Either option seems ugly.

@tafia
Copy link

tafia commented Aug 15, 2018

The initialization step (parsing of metadata) is indeed inefficient in sas.

On my side, I had to read just one page at a time as long as the page type is Meta, Amd or Compressed.
In pseudo language it give something like:

header = read_header(); // defines lot of data, including page count etc ...
page_type = Meta;
sub_header_data = null; // a page can contain metadata and eventually some data in sub_headers
page_data_start = -1;
page_data_remaining_blocks = -1;
while page_type in [Meta, Amd, Compressed] && sub_header_data == null && page_data_start == -1 {
     // reads page header, and eventually the data in subheaders (might happen)
    (page_type, sub_header_data, page_data_start, page_data_remaining_blocks) = read_page_header_meta();
}
// all metadata is processed, everything is *immutable* from now, just pure reading
// which means you can now process many pages in parallel if needed, starting eventually
// with the current page (either sub-headers or the data on the last blocks)

EDIT: As for the metadata at the end of the file, I think it is mainly for extensions. I mostly ignore them.
This is a bit longer to write but you can just have first one page at a time, then only a few at a time in memory.

@xiaodaigh
Copy link
Contributor Author

Another option would be to let the user specify a "force option"

I like this option. It doesn't make sense for the incremental read to have to read the whole file before it can return the first chunk, so currently the cure is worse then the disease.

I also I would say this force should be the default, as I think metadata being spread-out is more rare.

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented Oct 28, 2018

A small bounty is available for this issue.

The amount of $ on offer doesn't equate the work, please don't feel offended in anyway. But I do hope more people can pitch in and I just wanted to test out the Bountysource platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants