possible to pass IO instead of filename? #68

tbeason · 2020-04-02T05:52:57Z

Right now all of the reading functions require a filename. That prohibits at least one usecase -- reading from compressed files. Suppose I compress a large file using gzip so that it becomes bigfile.sas7bdat.gz. Julia is cool and has lots of cool packages, like CodecZlib.jl, which might let me do something like

using SASLib, CodecZlib
stream = GzipDecompressorStream(open("bigfile.sas7bdat.gz"))
rs = readsas(stream)

except this doesn't work.

Would this be possible to add? Or would it slow down parsing so much that it just makes sense to unzip the file before reading it in?

The text was updated successfully, but these errors were encountered:

tk3369 · 2020-04-03T02:00:08Z

Right. Currently, it opens the file and read blocks of data. Because the way it process metadata, it also need to seek to specific position occasionally. Hence it can't really stream data per se.

However, I think it would be possible to read from the stream into an IOBuffer. Then, readsas can just work with the data in memory. The drawback is that the file will be decompressed in memory, and if you work with a large file then it will have a higher memory footprint until the file is completely processed.

It seems easier to just decompress it outside of SASLib and read the file normally. Also, the SAS data file could be compressed (SASLib supports two compression algorithms today) and so gzip'ing file may not save much disk space anyways. Did you get the file from some other source?

tbeason · 2020-04-03T03:38:12Z

I have not had much success with the SAS compression algorithms. I get >80% compression with gzip.

The scenario is that I get my data from a remote server which only has SAS installed. I gzip the files so that I can download them locally to work with them. Sometimes they can be quite large, so it would be nice if I didn't have to unzip them.

rvasil · 2020-04-03T08:31:36Z

Hi, just a side note: SAS dataset compression is not that efficient as file compression because it operates on per record basis. You'll usually see good compression ratio if you have quite long character variables with lots of blanks that get "compressed" or with lots of missing values. But I usually see getting a zip file at least 10x smaller that input SAS dataset. SAS docs

Let me use this opportunity to thank @tk3369 for this package. It has made one of my recent projects possible and actually drove me to Julia. On our data, the performance of Python libraries when reading SAS datasets made them not feasible for the task.
Kind regards,
Rastislav

tk3369 · 2020-04-15T07:01:48Z

I am glad that the package is helpful. I also use it in a production system for my day job.

The process that I mentioned above for reading metadata was introduced in PR #44 to fix a problem where the metadata are split into the head & tail pages of the file. I don't really like that fix because it makes reading large files slow (as described in #50) and is hindering our ability to decompress on the fly. My hunch is that the original problem was an edge case anyways. Perhaps I would take that fix to a branch and do something different in master.

When I have more time, I will come back and take a look at these things.

tk3369 added the enhancement label Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible to pass IO instead of filename? #68

possible to pass IO instead of filename? #68

tbeason commented Apr 2, 2020

tk3369 commented Apr 3, 2020

tbeason commented Apr 3, 2020

rvasil commented Apr 3, 2020 •

edited

Loading

tk3369 commented Apr 15, 2020

possible to pass IO instead of filename? #68

possible to pass IO instead of filename? #68

Comments

tbeason commented Apr 2, 2020

tk3369 commented Apr 3, 2020

tbeason commented Apr 3, 2020

rvasil commented Apr 3, 2020 • edited Loading

tk3369 commented Apr 15, 2020

rvasil commented Apr 3, 2020 •

edited

Loading