Proposal: `Pair` iterator to create tables with keys and to implement groupby/groupreduce #139

piever · 2018-03-23T16:14:46Z

I've thought more about how to implement all the various IndexedTables operations in an "iterator" way, and it seems to me that collect_columns is not quite sufficient as most operations will be iterating both keys and values. Even for map it would be nice to have a way to denote that some columns of the output will be primary.

Here is my proposal:

Special case collect_columns to the case where the eltype is Pair{T1, T2} where {T1<:Tup, T2}, which would be collected as a Pair of Columns (or a Pair of Columns and Array if T2 is not a Tup. This is not too hard to do, one just needs to overload Base.setindex!, Base.push!, Base.eltype for Pair{<:Columns, <:Any} (which I hope is not type piracy as we own Columns)
If necessary, add some methods convert(::NextTable, ::Pair{<:Columns, <:Any}) and convert(::NDSparse, ::Pair{<:Columns, <:Any})
Reimplement groupby, groupreduce and join using iterators of Pairs of Tup and collect_columns
In map accept functions that return Pair of Tup, in which case collect as table with primary keys
In the table(iterator) constructor, accept iterators that iterate Pair of Tup, in which case collect as table with primary keys

The text was updated successfully, but these errors were encountered:

davidanthoff · 2018-03-23T19:46:25Z

Yeeeesss! I was just going to suggest this a couple of days ago (again, actually, see queryverse/IterableTables.jl#26, which I believe is the same idea?).

davidanthoff · 2018-03-24T03:07:24Z

Although, I think the iterators shouldn't return pairs, I think they should just continue to return named tuples, I think otherwise the syntax for most operations would get really weird. So I'm a huge fan of having collect style stuff accept pairs, but not to iterate pairs.

piever · 2018-03-24T10:59:19Z

Yes, that's the same as your issue from last year :)

I also agree that one should iterate NamedTuples, not pairs... In terms of implementation it may be useful to have a ColumnsPair object (though possibly not exported for now) that would iterate pairs and be used to collect this pairs, plus methods:

convert(NextTable, ColumnsPair)
convert(NDSparse, ColumnsPair)

that do not copy the data.

davidanthoff · 2018-03-25T02:39:45Z

Yes, that is a cool idea. Another option once 0.7 is out would be to have a custom row type that implements getproperty, so essentially behaves like a named tuple, but also contains information about indices. When you for example just filter, one could recover the index information from the iterated rows.

shashi · 2018-03-26T06:45:24Z

Special case collect_columns to the case where the eltype is Pair{T1, T2} where {T1<:Tup, T2}, which would be collected as a Pair of Columns (or a Pair of Columns and Array if T2 is not a Tup. This is not too hard to do, one just needs to overload Base.setindex!, Base.push!, Base.eltype for Pair{<:Columns, <:Any} (which I hope is not type piracy as we own Columns)

I think it helps to go back to the purpose of Columns: to provide Columnar storage for array of structs.

So I think we should use Columns{T<:Pair, S<:Pair} type where the column vectors themselves are stored in a Pair of vectors (or Pair of Columns -- seems like this would make the implementation simpler.). Of course the current restriction that T<:Tup is mostly to say we haven't thought about other uses of Columns yet.

If necessary, add some methods convert(::NextTable, ::Pair{<:Columns, <:Any}) and convert(::NDSparse, ::Pair{<:Columns, <:Any})

this would become convert(::NextTable, Columns{<:Pair}) then.

In map accept functions that return Pair of Tup, in which case collect as table with primary keys

Sounds fine. However I agree that iteration should avoid Pair (in other words, map should get NamedTuple as it does now and not Pair by default). It's a bit unfortunate that we already use => for selectors, or else we could have had a selector Keys() => Values() which iterates key value pairs. But I wouldn't worry too much about it now.

In the table(iterator) constructor, accept iterators that iterate Pair of Tup, in which case collect as table with primary keys

Sounds good.

Thanks for opening this great issue!

piever · 2018-03-26T10:33:50Z

Good point, Columns{<:Pair, <:Pair} makes more sense than Pair{<:Columns, <:Columns} and solves the type piracy issue, that's definitely the way to go.

piever · 2018-03-26T11:31:17Z

I wonder what should c.columns be for c::Columns{<:Pair, <:Pair}: do we want it to be Pair{Columns, Columns} or Pair{Tuple of vectors, Tuple of vectors}? Somehow I feel that the first option would make it easier to implement most methods as an obvious extension of the methods for Columns{<:Tup, <:Tup}.

EDIT: ups, I've just seen that you're recommending the same above... We should probably go for the Pair of Columns storage rather than Pair of Tuple of Vectors

shashi · 2018-03-26T12:21:48Z

Good question. Yeah it seems Pair{Columns, Columns} makes some things a little more natural. E.g. getindex will just work if implemented with map on c.columns.

piever · 2018-03-26T12:58:38Z

The only annoyance is that for some reason there is no method map(f, p::Pair), see #10936, but we can just define map_pair(f, p::Pair) = f(p.first) => f(p.second) and equal to map otherwise.

UPDATE: after some experimenting it seems that storing as a pair of Columns simplifies things considerably. I'll try to add at least enough methods to it to make sure that one can use it to run collect_columns on a pair iterator. We should maybe only start documenting it when we're sure that all methods that run on Columns give sensible results for Columns of pairs.

shashi · 2018-04-02T05:02:59Z

Can be closed now right?

piever · 2018-04-02T08:13:11Z

I think it's still missing the NDSparse constructor from the Pair iterable. Should be straightforward but I'm only unsure whether it should be implemented as NDSparse or ndsparse as now I see a mixture of both (the TableTraits integration used NDSparse, but other things seem to use ndsparse)

piever · 2018-04-03T10:42:21Z

Superseeded by #151

piever mentioned this issue Mar 26, 2018

Define and collect Columns{Pair} #140

Merged

piever mentioned this issue Apr 2, 2018

move group algorithms to collect_columns technique #150

Merged

piever closed this as completed Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: `Pair` iterator to create tables with keys and to implement groupby/groupreduce #139

Proposal: `Pair` iterator to create tables with keys and to implement groupby/groupreduce #139

piever commented Mar 23, 2018

davidanthoff commented Mar 23, 2018

davidanthoff commented Mar 24, 2018

piever commented Mar 24, 2018

davidanthoff commented Mar 25, 2018

shashi commented Mar 26, 2018

piever commented Mar 26, 2018

piever commented Mar 26, 2018 •

edited

Loading

shashi commented Mar 26, 2018

piever commented Mar 26, 2018 •

edited

Loading

shashi commented Apr 2, 2018

piever commented Apr 2, 2018

piever commented Apr 3, 2018

Proposal: Pair iterator to create tables with keys and to implement groupby/groupreduce #139

Proposal: Pair iterator to create tables with keys and to implement groupby/groupreduce #139

Comments

piever commented Mar 23, 2018

davidanthoff commented Mar 23, 2018

davidanthoff commented Mar 24, 2018

piever commented Mar 24, 2018

davidanthoff commented Mar 25, 2018

shashi commented Mar 26, 2018

piever commented Mar 26, 2018

piever commented Mar 26, 2018 • edited Loading

shashi commented Mar 26, 2018

piever commented Mar 26, 2018 • edited Loading

shashi commented Apr 2, 2018

piever commented Apr 2, 2018

piever commented Apr 3, 2018

Proposal: `Pair` iterator to create tables with keys and to implement groupby/groupreduce #139

Proposal: `Pair` iterator to create tables with keys and to implement groupby/groupreduce #139

piever commented Mar 26, 2018 •

edited

Loading

piever commented Mar 26, 2018 •

edited

Loading