-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SASLib.open hangs for large files #50
Comments
This inefficiency was introduced from fix #44 (see comments there). If you can't wait for the fix and if you use Julia 0.6 then you can just pin to the prior stable release v0.5.2. As for the proper fix, I'm uncertain what's the best way. I was surprised to find meta data being split into the beginning and end of the file. Should I assume that meta data may appear at the last page and jump directly there? What if there are 2 or more meta pages at the end? Another option would be to let the user specify a "force option" that make the program break out of the loop without checking the end. Either option seems ugly. |
The initialization step (parsing of metadata) is indeed inefficient in sas. On my side, I had to read just one page at a time as long as the page type is Meta, Amd or Compressed. header = read_header(); // defines lot of data, including page count etc ...
page_type = Meta;
sub_header_data = null; // a page can contain metadata and eventually some data in sub_headers
page_data_start = -1;
page_data_remaining_blocks = -1;
while page_type in [Meta, Amd, Compressed] && sub_header_data == null && page_data_start == -1 {
// reads page header, and eventually the data in subheaders (might happen)
(page_type, sub_header_data, page_data_start, page_data_remaining_blocks) = read_page_header_meta();
}
// all metadata is processed, everything is *immutable* from now, just pure reading
// which means you can now process many pages in parallel if needed, starting eventually
// with the current page (either sub-headers or the data on the last blocks) EDIT: As for the metadata at the end of the file, I think it is mainly for extensions. I mostly ignore them. |
I like this option. It doesn't make sense for the incremental read to have to read the whole file before it can return the first chunk, so currently the cure is worse then the disease. I also I would say this |
A small bounty is available for this issue. The amount of $ on offer doesn't equate the work, please don't feel offended in anyway. But I do hope more people can pitch in and I just wanted to test out the Bountysource platform. |
I have a large (144G) SAS dataset that sits on a share drive and I wanted to see how quickly I can read this file incrementally.
handler = SASLib.open(fp)
was taking a long time, it's been 10mins and still not returned.So I tried to trace through the code and I found that the code hangs at
read_file_metadata(handler)
and the I found the comment which readsSo I think this is a known issue.
The text was updated successfully, but these errors were encountered: