Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply filters when reading files #26

Open
tbeason opened this issue Feb 2, 2018 · 6 comments
Open

Apply filters when reading files #26

tbeason opened this issue Feb 2, 2018 · 6 comments

Comments

@tbeason
Copy link
Contributor

tbeason commented Feb 2, 2018

I think it should be possible to include filters via keyword args in the readsas/open function. For example, readsas("data.sas7bdat",include_columns=[:DATE,:ID,:VALUE1],filters=[ Date(1960) <= :DATE <= Date(2000), :ID in (1,2,3,4,10), :VALUE1 >0])

It doesn't feel very efficient to read in a large data set and then filter away half of the observations. My guess is that is has to be much faster to just iterate line by line (or chunk by chunk) and apply the filters as the file is read.

In practice, this would be much more beneficial (IMO) than trying to edge out a few more performance improvements that speed up reading by a few % each.

@tk3369
Copy link
Owner

tk3369 commented Feb 11, 2018

Having a declarative expression looks nice but it seems to be a bit tricky to implement and not flexible.

Let's say we design a functional interface where it takes a Dict of (column_symbol => value) and it just returns a boolean when the row should be included in the results:

f(x::Dict)::Bool

Then, you could do something like:

filter(x) = x[:DATE] >= Date(1960) && x[:DATE] <= Date(2000) && x[:ID] in (1,2,3,4,5)
readsas("data.sas7bdat", filter_fn=filter)

There will be a slight performance hit as the reader will have to put the data into a Dict and then call the filter function. We can possibly reuse the Dict so it doesn't allocate like crazy.

@tk3369
Copy link
Owner

tk3369 commented Apr 1, 2018

This is a little tricky to get right:

The code currently pre-allocate all column arrays upfront as we know how many rows are in the file after reading the header. Since the number of rows being returned is unknown, we would have to allocate small arrays and grow it as we retrieve more data that satisfy the filter conditions. If we do it one by one then it will be a a huge performance hit. If we grow it in chunks (or exponentially) then we would still need to truncate the unused cells at the end.

We may need to use some heuristic again depending on number of rows (N) and/or the size (S) of the file. If S is large (say, > 1 GiB) then allocate 20% (i.e. N/5) and grow it 20% at a time. That limits the number of reallocation to 4 times. If S is not large, then just allocate 100%. At the end, reallocate one last time to smaller arrays.

@tk3369 tk3369 changed the title Suggestion: apply filters when reading files Apply filters when reading files Apr 1, 2018
@davidanthoff
Copy link
Contributor

If you push! into a vector, it will actually allocate memory in chunks. I think whenever it runs out of memory for a vector, it will allocate one with twice the memory (that ratio might be wrong). So if you just push!, it should be a relatively efficient algorithm by default.

We might also revisit a streaming read mode, i.e. where rows are iterated, then this issue would not come up inside this package here.

@tk3369
Copy link
Owner

tk3369 commented Aug 15, 2018

I like the idea of streaming read mode, perhaps integrating with Queryverse. Don't know how to do that yet. I can revisit that once the TableTraits/Tables dust is settled 🙂

@davidanthoff
Copy link
Contributor

Yes, let me just reiterate that I'ld love to use this package as the backend for StatFiles.jl! I'm swamped right now and first need to get everything up and running on 0.7/1.0 before I think about new stuff, but I'd love to see that kind of integration.

@tk3369
Copy link
Owner

tk3369 commented Aug 26, 2018

No problem! 0.7/1.0 is definitely the priority. Thanks for your hard work and contribution to the Julia ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants