Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible to pass IO instead of filename? #68

Open
tbeason opened this issue Apr 2, 2020 · 4 comments
Open

possible to pass IO instead of filename? #68

tbeason opened this issue Apr 2, 2020 · 4 comments

Comments

@tbeason
Copy link
Contributor

tbeason commented Apr 2, 2020

Right now all of the reading functions require a filename. That prohibits at least one usecase -- reading from compressed files. Suppose I compress a large file using gzip so that it becomes bigfile.sas7bdat.gz. Julia is cool and has lots of cool packages, like CodecZlib.jl, which might let me do something like

using SASLib, CodecZlib
stream = GzipDecompressorStream(open("bigfile.sas7bdat.gz"))
rs = readsas(stream)

except this doesn't work.

Would this be possible to add? Or would it slow down parsing so much that it just makes sense to unzip the file before reading it in?

@tk3369
Copy link
Owner

tk3369 commented Apr 3, 2020

Right. Currently, it opens the file and read blocks of data. Because the way it process metadata, it also need to seek to specific position occasionally. Hence it can't really stream data per se.

However, I think it would be possible to read from the stream into an IOBuffer. Then, readsas can just work with the data in memory. The drawback is that the file will be decompressed in memory, and if you work with a large file then it will have a higher memory footprint until the file is completely processed.

It seems easier to just decompress it outside of SASLib and read the file normally. Also, the SAS data file could be compressed (SASLib supports two compression algorithms today) and so gzip'ing file may not save much disk space anyways. Did you get the file from some other source?

@tbeason
Copy link
Contributor Author

tbeason commented Apr 3, 2020

I have not had much success with the SAS compression algorithms. I get >80% compression with gzip.

The scenario is that I get my data from a remote server which only has SAS installed. I gzip the files so that I can download them locally to work with them. Sometimes they can be quite large, so it would be nice if I didn't have to unzip them.

@rvasil
Copy link

rvasil commented Apr 3, 2020

Hi, just a side note: SAS dataset compression is not that efficient as file compression because it operates on per record basis. You'll usually see good compression ratio if you have quite long character variables with lots of blanks that get "compressed" or with lots of missing values. But I usually see getting a zip file at least 10x smaller that input SAS dataset. SAS docs

Let me use this opportunity to thank @tk3369 for this package. It has made one of my recent projects possible and actually drove me to Julia. On our data, the performance of Python libraries when reading SAS datasets made them not feasible for the task.
Kind regards,
Rastislav

@tk3369
Copy link
Owner

tk3369 commented Apr 15, 2020

I am glad that the package is helpful. I also use it in a production system for my day job.

The process that I mentioned above for reading metadata was introduced in PR #44 to fix a problem where the metadata are split into the head & tail pages of the file. I don't really like that fix because it makes reading large files slow (as described in #50) and is hindering our ability to decompress on the fly. My hunch is that the original problem was an edge case anyways. Perhaps I would take that fix to a branch and do something different in master.

When I have more time, I will come back and take a look at these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants