Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Pair iterator to create tables with keys and to implement groupby/groupreduce #139

Closed
piever opened this issue Mar 23, 2018 · 12 comments

Comments

@piever
Copy link
Collaborator

piever commented Mar 23, 2018

I've thought more about how to implement all the various IndexedTables operations in an "iterator" way, and it seems to me that collect_columns is not quite sufficient as most operations will be iterating both keys and values. Even for map it would be nice to have a way to denote that some columns of the output will be primary.

Here is my proposal:

  • Special case collect_columns to the case where the eltype is Pair{T1, T2} where {T1<:Tup, T2}, which would be collected as a Pair of Columns (or a Pair of Columns and Array if T2 is not a Tup. This is not too hard to do, one just needs to overload Base.setindex!, Base.push!, Base.eltype for Pair{<:Columns, <:Any} (which I hope is not type piracy as we own Columns)

  • If necessary, add some methods convert(::NextTable, ::Pair{<:Columns, <:Any}) and convert(::NDSparse, ::Pair{<:Columns, <:Any})

  • Reimplement groupby, groupreduce and join using iterators of Pairs of Tup and collect_columns

  • In map accept functions that return Pair of Tup, in which case collect as table with primary keys

  • In the table(iterator) constructor, accept iterators that iterate Pair of Tup, in which case collect as table with primary keys

@davidanthoff
Copy link
Contributor

Yeeeesss! I was just going to suggest this a couple of days ago (again, actually, see queryverse/IterableTables.jl#26, which I believe is the same idea?).

@davidanthoff
Copy link
Contributor

Although, I think the iterators shouldn't return pairs, I think they should just continue to return named tuples, I think otherwise the syntax for most operations would get really weird. So I'm a huge fan of having collect style stuff accept pairs, but not to iterate pairs.

@piever
Copy link
Collaborator Author

piever commented Mar 24, 2018

Yes, that's the same as your issue from last year :)

I also agree that one should iterate NamedTuples, not pairs... In terms of implementation it may be useful to have a ColumnsPair object (though possibly not exported for now) that would iterate pairs and be used to collect this pairs, plus methods:

convert(NextTable, ColumnsPair)
convert(NDSparse, ColumnsPair)

that do not copy the data.

@davidanthoff
Copy link
Contributor

Yes, that is a cool idea. Another option once 0.7 is out would be to have a custom row type that implements getproperty, so essentially behaves like a named tuple, but also contains information about indices. When you for example just filter, one could recover the index information from the iterated rows.

@shashi
Copy link
Collaborator

shashi commented Mar 26, 2018

Special case collect_columns to the case where the eltype is Pair{T1, T2} where {T1<:Tup, T2}, which would be collected as a Pair of Columns (or a Pair of Columns and Array if T2 is not a Tup. This is not too hard to do, one just needs to overload Base.setindex!, Base.push!, Base.eltype for Pair{<:Columns, <:Any} (which I hope is not type piracy as we own Columns)

I think it helps to go back to the purpose of Columns: to provide Columnar storage for array of structs.

So I think we should use Columns{T<:Pair, S<:Pair} type where the column vectors themselves are stored in a Pair of vectors (or Pair of Columns -- seems like this would make the implementation simpler.). Of course the current restriction that T<:Tup is mostly to say we haven't thought about other uses of Columns yet.

If necessary, add some methods convert(::NextTable, ::Pair{<:Columns, <:Any}) and convert(::NDSparse, ::Pair{<:Columns, <:Any})

this would become convert(::NextTable, Columns{<:Pair}) then.

In map accept functions that return Pair of Tup, in which case collect as table with primary keys

Sounds fine. However I agree that iteration should avoid Pair (in other words, map should get NamedTuple as it does now and not Pair by default). It's a bit unfortunate that we already use => for selectors, or else we could have had a selector Keys() => Values() which iterates key value pairs. But I wouldn't worry too much about it now.

In the table(iterator) constructor, accept iterators that iterate Pair of Tup, in which case collect as table with primary keys

Sounds good.

Thanks for opening this great issue!

@piever
Copy link
Collaborator Author

piever commented Mar 26, 2018

Good point, Columns{<:Pair, <:Pair} makes more sense than Pair{<:Columns, <:Columns} and solves the type piracy issue, that's definitely the way to go.

@piever
Copy link
Collaborator Author

piever commented Mar 26, 2018

I wonder what should c.columns be for c::Columns{<:Pair, <:Pair}: do we want it to be Pair{Columns, Columns} or Pair{Tuple of vectors, Tuple of vectors}? Somehow I feel that the first option would make it easier to implement most methods as an obvious extension of the methods for Columns{<:Tup, <:Tup}.

EDIT: ups, I've just seen that you're recommending the same above... We should probably go for the Pair of Columns storage rather than Pair of Tuple of Vectors

@shashi
Copy link
Collaborator

shashi commented Mar 26, 2018

Good question. Yeah it seems Pair{Columns, Columns} makes some things a little more natural. E.g. getindex will just work if implemented with map on c.columns.

@piever
Copy link
Collaborator Author

piever commented Mar 26, 2018

The only annoyance is that for some reason there is no method map(f, p::Pair), see #10936, but we can just define map_pair(f, p::Pair) = f(p.first) => f(p.second) and equal to map otherwise.

UPDATE: after some experimenting it seems that storing as a pair of Columns simplifies things considerably. I'll try to add at least enough methods to it to make sure that one can use it to run collect_columns on a pair iterator. We should maybe only start documenting it when we're sure that all methods that run on Columns give sensible results for Columns of pairs.

@shashi
Copy link
Collaborator

shashi commented Apr 2, 2018

Can be closed now right?

@piever
Copy link
Collaborator Author

piever commented Apr 2, 2018

I think it's still missing the NDSparse constructor from the Pair iterable. Should be straightforward but I'm only unsure whether it should be implemented as NDSparse or ndsparse as now I see a mixture of both (the TableTraits integration used NDSparse, but other things seem to use ndsparse)

@piever
Copy link
Collaborator Author

piever commented Apr 3, 2018

Superseeded by #151

@piever piever closed this as completed Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants