-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply filters when reading files #26
Comments
Having a declarative expression looks nice but it seems to be a bit tricky to implement and not flexible. Let's say we design a functional interface where it takes a Dict of (column_symbol => value) and it just returns a boolean when the row should be included in the results:
Then, you could do something like:
There will be a slight performance hit as the reader will have to put the data into a Dict and then call the filter function. We can possibly reuse the Dict so it doesn't allocate like crazy. |
This is a little tricky to get right: The code currently pre-allocate all column arrays upfront as we know how many rows are in the file after reading the header. Since the number of rows being returned is unknown, we would have to allocate small arrays and grow it as we retrieve more data that satisfy the filter conditions. If we do it one by one then it will be a a huge performance hit. If we grow it in chunks (or exponentially) then we would still need to truncate the unused cells at the end. We may need to use some heuristic again depending on number of rows (N) and/or the size (S) of the file. If S is large (say, > 1 GiB) then allocate 20% (i.e. N/5) and grow it 20% at a time. That limits the number of reallocation to 4 times. If S is not large, then just allocate 100%. At the end, reallocate one last time to smaller arrays. |
If you We might also revisit a streaming read mode, i.e. where rows are iterated, then this issue would not come up inside this package here. |
I like the idea of streaming read mode, perhaps integrating with Queryverse. Don't know how to do that yet. I can revisit that once the TableTraits/Tables dust is settled 🙂 |
Yes, let me just reiterate that I'ld love to use this package as the backend for StatFiles.jl! I'm swamped right now and first need to get everything up and running on 0.7/1.0 before I think about new stuff, but I'd love to see that kind of integration. |
No problem! 0.7/1.0 is definitely the priority. Thanks for your hard work and contribution to the Julia ecosystem. |
I think it should be possible to include filters via keyword args in the
readsas
/open
function. For example,readsas("data.sas7bdat",include_columns=[:DATE,:ID,:VALUE1],filters=[ Date(1960) <= :DATE <= Date(2000), :ID in (1,2,3,4,10), :VALUE1 >0])
It doesn't feel very efficient to read in a large data set and then filter away half of the observations. My guess is that is has to be much faster to just iterate line by line (or chunk by chunk) and apply the filters as the file is read.
In practice, this would be much more beneficial (IMO) than trying to edge out a few more performance improvements that speed up reading by a few % each.
The text was updated successfully, but these errors were encountered: