-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback from package authors #2
Comments
Bunch of updates, new examples in the README. Added support for SplitApplyCombine @andyferris (though not tested yet) |
Again sorry for the pings I just get overly excited about programming lol |
Ok so I'm basically happy with whats here. I've got a chaining mechanism, a lazy call mechanism, and several model query verbs. I've got extremely unoptimized methods for these verbs on NamedTables. My hope is that people will import these methods and optimize them for various data structures, so that essentially DataFramesMeta could just hold the DataFrames methods and QueryOperators could just hold the Enumerable methods. |
Hi @bramtayl, It's reassuring, in many ways, to see so many of us interested in the same set of problems. :) And it's great to be excited! What I didn't really get (until maybe a hint in your most recent comment) was what the overall design philosophy was. Some kind of introduction in the README or documentation would probably be quite useful. Similarly, an example that shows how it all comes together. I'm taking this package as a convenient interface to manipulate Anyway, I'm happy to discuss - let us know how we can be useful.
Can you elaborate on this?
I'm not sure what NamedTables refers to, exactly? (I do love that you are thinking of these functions as "verbs". I've come to think of SplitApplyCombine as providing verbs, TypedTables as providing nouns and AcceleratedArrays as providing "adverbs". I'm not sure if that is silly?) |
NamedTables was a typo for NamedTuples. So basically this is just a straightforward port of dplyr from R. The methods only work on NamedTuples right now, but I think that they could work on just about anything vaguely tabular. QueryOperators has a set of unexported verbs with methods on Enumerables, and DataFramesMeta has a set of unexported verbs with methods on DataFrames. If instead both packages just extended the methods in this package, then we could have one uniform tabular query interface? Previous iterations of this package had actually sat down and created these bridges. So for example, I had group(data::Enumerable, n::Nameless) =
QueryOperators.groupby(data::Enumerable, n.f, n.expression)
group(data::AbstractDataFrame, n) =
DataFrames.groupby(data, n)
group(data, n) =
SplitApplyCombine.group(n, data) These are gone from the package at the moment by Occam's razor. But, for example, with a bit of elbow grease, I think most if not all of QueryOperators could be refactored as methods of LightQuery verbs. SplitApplyCombine is a bit trickier cause it's not explicitly indicated for tabular data... I haven't taken a look at JuliaDB yet cause last time I checked it wasn't working on 1.0, but I'm pretty sure it could be integrated in a similar way. |
OK - thanks for the explanation.
It would be good to share the same fundamental operations, yes. I strongly feel that these should eventually become operations in
Yes, indeed, my personal approach so far has been to see what generic operations are good for doing relational algebra. For example, if you take the textbook definition of a relation as a collection of (named) tuples, port that to Julia and say that this is any object that supports If we can fill this out with all the |
I'm with you that row-wise operations are basically reducible to Base iterators. I'm still pretty sure you need specific interfaces for tabular data (or at least, they make things easier). Certainly select, remove, transform, and based_on are all specific to tabular data. And if you want to just be able to just say So bottom line: relational algebra standard library: yes! This is kinda what I tried to set out to do here but I think it would be better to be a stdlib (or at least something more official) |
Yes, indeed - specific interfaces for tabular data will make things much, much more usable. I guess I am thinking these interfaces should act as syntax sugar. Functions which take symbol names to identify columns would simply create closures or whatever and call higher-order functions like |
Ok, in that case, how about this for a proposal: struct Keys{Names} end
Keys(names::Symbol...) = Keys{names}()
Keys(:a, :b, :c) You could pass in Keys instead of an anonymous function into innerjoin, group, orderby, etc. and use dispatch to get the desired tabular data specific method? |
So then here's what happens if you start trying to delete functions from LightQuery: Don't really need any more: as_rows/as_columns/pretty: probably better suited to a specifically built tabular data interface like TypedTables where: filter + columnwise optimization I think still useful: select/remove: really functions which should exist in Base, but can stay here for now? rename would be great here too. |
Sure, something exactly like that. I have Ideally we'd make a more powerful |
True, but different functions would want to do different things with different columns. Like groupby would just want to select the columns. orderby would want to select the columns and then run isless. inner_join would want to select the columns and then test for inequality. So I think a dedicated Keys struct makes sense? |
Yes. (Note that I think TypedTables and SplitApplyCombine currently have the mechanics for columnar optimization for all the items on your list, so long as you are grouping or joining by just one column). |
I think the different steps could be composed? For example
So you do something like |
Ok, well I've got a (probably not constant inferable) version of rename now: export rename
"""
rename(data; renames)
```jldoctest
julia> using LightQuery
julia> rename((a = 1, b = 2), :a => :c)
(b = 2, c = 1)
\```
"""
function rename(data::NamedTuple, renames...)
olds = map(pair -> pair.first, renames)
merge(
remove(data, olds...),
NamedTuple{map(pair -> pair.second, renames)}(select(data, olds...)...)
)
end So what would you think about registering a request in Base for dedicated select, delete, and rename methods on NamedTuples (with someone over there working some constant propagation magic) Then I can just keep transform, based_on, and the two macros here as a tiny package? |
y_selector = x_selector |
Ultimately, yes. I suggest we first make a mini-interface for manipulation of objects with properties - what @quinnj calls the "PropertyAccessable" interface. We can do this in a small package with lots of prototyping to reduce churn in
Yeah, I'd love to work more on that... for example a natural inner join should be very easy to write (as easy as matrix multiplication). To give an idea of what I want, LINQ has a pretty front-end syntax that slightly resembles SQL (that we can implement with macros, as you want here, and as done in Query.jl and in DataFramesMeta.jl) and C# lowering just transforms these to normal method calls. My The important thing about LINQ is that it also works on non-tabular data... you can use all these methods to traverse XML and JSON and whatever data structures you have at hand. There's no assumptions of columns, or of named tuples, or any of that, and hence the methods in |
Ok cool I'm on board |
😄 |
Also +1 on natural joins |
Ok, so based off of this feedback, I've: removed all the rowwise functions (where, order_by, chunk_by, ungroup, inner_join) name, rename How does that look? What else would a property accessible interface need? |
I really want
I'm going to be swamped today, I'll have a dive in when I'm able. |
I think this all looks great, but what I don't understand how this is different from the Query.jl/QueryOperators.jl design, in a broad sense? For example, https://github.com/queryverse/QueryOperators.jl/blob/master/src/operators.jl is where I've defined the basic query operators (or verbs) for quite a while, and then the whole idea of having different backends, an iterator based fallback implementation that works with not just tables but anything etc. is all what has been the core design of Query.jl for a couple of years now. I would really love to collaborate on all of this, but at the same time I would also very much not like to start from scratch, but ideally just evolve the existing implementations in Query/QueryOperators to gain new functionality. If there are some fundamental limitations in the design over there, it would be great to hear about them. |
I've greatly reduced the scope of the package at Andy's suggestion. At this point it is just 1) a basic interface for operations on a single NamedTuple and 2) a couple of useful macros for Query-ing. So this package is perfect for interfacing with QueryOperators; for example, this could work: using DataFrames: DataFrame
using Query: query
import QueryOperators
using LightQuery
@> DataFrame(a = [1, 2, 3], b = [1.0, 2.0, 3.0]) |>
query(_) |>
QueryOperators.map(
(@_ transform(_, c = @_ _.a + _.b))
_
) |>
collect(_) QueryOperators.map will just need a |
And if you wanted to simplify the syntax a bit, you could just add a QueryOperators.transform convenience function which does the above:
Or just overload and reexport the transform that's here for ::Enumerable |
I've updated the package so it does constant propagation (mostly with a bunch of |
I fully agree we should agree on a common minimal API for these operations so that they can be used with any data structure.
Cc: @piever for JuliaDBMeta |
JuliaDB and JuliaDBMeta have just been ported to Julia 1.0. In general most things are row wise there, so macros for working with You may want to check https://github.com/JuliaData/TableOperations.jl, which is an attempt of implementing queries directly in terms of the |
I'm really looking forward to that one. |
Oooh TableOperations looks exciting. It looks a little less fully featured than I would like, though. Are there more functions planned (e.g. remove, rename, based_on, gather, spread, etc.)? |
I think that's where you come in :) More seriously, I planned to contribute some things but don't really have the resources right now. From what I understand @quinnj just put up a proof of concept to have a place where we can gather all the various implementations of things that can be expressed purely in this |
I mean that's kinda what I did here too. I'm happy to pitch in wherever I can. Down with dplyr. |
We've been working for a while on adding similar things to Query.jl/QueryOperators.jl, see queryverse/Query.jl#209 and queryverse/Query.jl#213. We originally created NamedTupleUtilities to hold all the utilities that make |
Then maybe it makes sense to merge the NamedTuples stuff from here into NamedTuplesUtilities? Then this package would just sink back into just holding two macros. |
@andyferris did you get a change to take another look? @piever I took a second look at TableOperations and it seems like it doesn't really quite pass Occam's razor. Why can't named-tuple-like structures simply overload the methods here (or if not here, then wherever standardized named tuple operations will live) |
I'm not sure I follow, but I think the idea is that a row of a table is whatever object implements You may choose as a return type for your functions whichever object implements the interface (it can always be a |
Ok, got it, so then what I really need to do is go through the NamedTuple operations in LightQuery and see if I can reduce them to only use getproperty and propertynames, then PR the results to TableOperations? |
Ok, well I did a bunch of refactoring to get LightQuery to only use getproperty and propertynames. Still having two constant prop issues (one in rename, one in unname for structs) that I think are really things for Base to work on. |
Oops forgot to push its up now |
Got stalled here JuliaData/Tables.jl#47 |
@andyferris I put up a fuller version and I'm excited about it |
@bramtayl Sorry, I unfortunately haven't had much time for Julia in the last month. I will say that what you've got looks useful. I'm sure it's a collection of tools that help you get stuff done :) For example, (To explain why I emphasised useful: I've been avoiding writing anything too useful because I'm mostly trying to persue/understand the right abstractions, which for me at least is a very slow process...) |
Useful sounds good to me. Hopefully the experiment here can help you figure out what the "right" abstractions are. |
So this is still experimental, but I'm dying for feedback from package authors. I hope this isn't rude. @davidanthoff @nalimilan I updated the readme to show how one macro can replace all of DataFramesMeta and Query. The syntax is slightly less terse but much more flexible.
The text was updated successfully, but these errors were encountered: