Apply filters when reading files #26

tbeason · 2018-02-02T01:14:53Z

I think it should be possible to include filters via keyword args in the readsas/open function. For example, readsas("data.sas7bdat",include_columns=[:DATE,:ID,:VALUE1],filters=[ Date(1960) <= :DATE <= Date(2000), :ID in (1,2,3,4,10), :VALUE1 >0])

It doesn't feel very efficient to read in a large data set and then filter away half of the observations. My guess is that is has to be much faster to just iterate line by line (or chunk by chunk) and apply the filters as the file is read.

In practice, this would be much more beneficial (IMO) than trying to edge out a few more performance improvements that speed up reading by a few % each.

The text was updated successfully, but these errors were encountered:

tk3369 · 2018-02-11T20:56:35Z

Having a declarative expression looks nice but it seems to be a bit tricky to implement and not flexible.

Let's say we design a functional interface where it takes a Dict of (column_symbol => value) and it just returns a boolean when the row should be included in the results:

f(x::Dict)::Bool

Then, you could do something like:

filter(x) = x[:DATE] >= Date(1960) && x[:DATE] <= Date(2000) && x[:ID] in (1,2,3,4,5)
readsas("data.sas7bdat", filter_fn=filter)

There will be a slight performance hit as the reader will have to put the data into a Dict and then call the filter function. We can possibly reuse the Dict so it doesn't allocate like crazy.

tk3369 · 2018-04-01T05:32:34Z

This is a little tricky to get right:

The code currently pre-allocate all column arrays upfront as we know how many rows are in the file after reading the header. Since the number of rows being returned is unknown, we would have to allocate small arrays and grow it as we retrieve more data that satisfy the filter conditions. If we do it one by one then it will be a a huge performance hit. If we grow it in chunks (or exponentially) then we would still need to truncate the unused cells at the end.

We may need to use some heuristic again depending on number of rows (N) and/or the size (S) of the file. If S is large (say, > 1 GiB) then allocate 20% (i.e. N/5) and grow it 20% at a time. That limits the number of reallocation to 4 times. If S is not large, then just allocate 100%. At the end, reallocate one last time to smaller arrays.

davidanthoff · 2018-05-31T17:48:35Z

If you push! into a vector, it will actually allocate memory in chunks. I think whenever it runs out of memory for a vector, it will allocate one with twice the memory (that ratio might be wrong). So if you just push!, it should be a relatively efficient algorithm by default.

We might also revisit a streaming read mode, i.e. where rows are iterated, then this issue would not come up inside this package here.

tk3369 · 2018-08-15T05:11:18Z

I like the idea of streaming read mode, perhaps integrating with Queryverse. Don't know how to do that yet. I can revisit that once the TableTraits/Tables dust is settled 🙂

davidanthoff · 2018-08-25T19:32:04Z

Yes, let me just reiterate that I'ld love to use this package as the backend for StatFiles.jl! I'm swamped right now and first need to get everything up and running on 0.7/1.0 before I think about new stuff, but I'd love to see that kind of integration.

tk3369 · 2018-08-26T01:30:15Z

No problem! 0.7/1.0 is definitely the priority. Thanks for your hard work and contribution to the Julia ecosystem.

tk3369 mentioned this issue Feb 20, 2018

SASDataResult object #30

Closed

tk3369 added the enhancement label Feb 24, 2018

tk3369 changed the title ~~Suggestion: apply filters when reading files~~ Apply filters when reading files Apr 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply filters when reading files #26

Apply filters when reading files #26

tbeason commented Feb 2, 2018

tk3369 commented Feb 11, 2018

tk3369 commented Apr 1, 2018

davidanthoff commented May 31, 2018

tk3369 commented Aug 15, 2018

davidanthoff commented Aug 25, 2018

tk3369 commented Aug 26, 2018

Apply filters when reading files #26

Apply filters when reading files #26

Comments

tbeason commented Feb 2, 2018

tk3369 commented Feb 11, 2018

tk3369 commented Apr 1, 2018

davidanthoff commented May 31, 2018

tk3369 commented Aug 15, 2018

davidanthoff commented Aug 25, 2018

tk3369 commented Aug 26, 2018