SASLib.open hangs for large files #50

xiaodaigh · 2018-08-15T03:32:02Z

I have a large (144G) SAS dataset that sits on a share drive and I wanted to see how quickly I can read this file incrementally. handler = SASLib.open(fp) was taking a long time, it's been 10mins and still not returned.

So I tried to trace through the code and I found that the code hangs at read_file_metadata(handler) and the I found the comment which reads

TODO however, this is inefficient since it reads a lot of data from disk

TODO can we tell if metadata is complete and break out of loop early?

So I think this is a known issue.

The text was updated successfully, but these errors were encountered:

tk3369 · 2018-08-15T05:04:41Z

This inefficiency was introduced from fix #44 (see comments there). If you can't wait for the fix and if you use Julia 0.6 then you can just pin to the prior stable release v0.5.2.

As for the proper fix, I'm uncertain what's the best way. I was surprised to find meta data being split into the beginning and end of the file. Should I assume that meta data may appear at the last page and jump directly there? What if there are 2 or more meta pages at the end? Another option would be to let the user specify a "force option" that make the program break out of the loop without checking the end. Either option seems ugly.

tafia · 2018-08-15T08:14:32Z

The initialization step (parsing of metadata) is indeed inefficient in sas.

On my side, I had to read just one page at a time as long as the page type is Meta, Amd or Compressed.
In pseudo language it give something like:

header = read_header(); // defines lot of data, including page count etc ...
page_type = Meta;
sub_header_data = null; // a page can contain metadata and eventually some data in sub_headers
page_data_start = -1;
page_data_remaining_blocks = -1;
while page_type in [Meta, Amd, Compressed] && sub_header_data == null && page_data_start == -1 {
     // reads page header, and eventually the data in subheaders (might happen)
    (page_type, sub_header_data, page_data_start, page_data_remaining_blocks) = read_page_header_meta();
}
// all metadata is processed, everything is *immutable* from now, just pure reading
// which means you can now process many pages in parallel if needed, starting eventually
// with the current page (either sub-headers or the data on the last blocks)

EDIT: As for the metadata at the end of the file, I think it is mainly for extensions. I mostly ignore them.
This is a bit longer to write but you can just have first one page at a time, then only a few at a time in memory.

xiaodaigh · 2018-08-16T03:58:50Z

Another option would be to let the user specify a "force option"

I like this option. It doesn't make sense for the incremental read to have to read the whole file before it can return the first chunk, so currently the cure is worse then the disease.

I also I would say this force should be the default, as I think metadata being spread-out is more rare.

xiaodaigh · 2018-10-28T00:50:29Z

A small bounty is available for this issue.

The amount of $ on offer doesn't equate the work, please don't feel offended in anyway. But I do hope more people can pitch in and I just wanted to test out the Bountysource platform.

xiaodaigh changed the title ~~SASLib.open hangs at~~ SASLib.open hangs for large files Aug 15, 2018

xiaodaigh mentioned this issue Oct 26, 2018

Problem reading file due to BoundsError in _process_columnname_subheader #44

Closed

tk3369 mentioned this issue Apr 15, 2020

possible to pass IO instead of filename? #68

Open

tk3369 added the performance label Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SASLib.open hangs for large files #50

SASLib.open hangs for large files #50

xiaodaigh commented Aug 15, 2018

TODO however, this is inefficient since it reads a lot of data from disk

TODO can we tell if metadata is complete and break out of loop early?

tk3369 commented Aug 15, 2018

tafia commented Aug 15, 2018 •

edited

Loading

xiaodaigh commented Aug 16, 2018

xiaodaigh commented Oct 28, 2018 •

edited

Loading

SASLib.open hangs for large files #50

SASLib.open hangs for large files #50

Comments

xiaodaigh commented Aug 15, 2018

TODO however, this is inefficient since it reads a lot of data from disk

TODO can we tell if metadata is complete and break out of loop early?

tk3369 commented Aug 15, 2018

tafia commented Aug 15, 2018 • edited Loading

xiaodaigh commented Aug 16, 2018

xiaodaigh commented Oct 28, 2018 • edited Loading

tafia commented Aug 15, 2018 •

edited

Loading

xiaodaigh commented Oct 28, 2018 •

edited

Loading