-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add join #31
add join #31
Conversation
Thanks for tackling this. A working implementation is better than no implementation. Still, I see what you mean when you call it ugly; there is a reason An inner join can actually be relatively simple and clean (untested code): def __next__(self):
cdef PyObject *obj = NULL
if not self.matches:
while obj is NULL:
self.right = next(self.rightseq)
key = self.rightkey(self.right)
obj = PyDict_GetItem(self.d, key)
self.matches = <object>obj
self.matches.reverse()
match = self.matches.pop()
return (match, self.right) I don't know if it's a good idea or faster to def __next__(self):
cdef PyObject *obj
obj = PyIter_Next(self.matches)
if obj is not NULL:
match = <object>obj
else:
# StopIteration is not considered an exception here.
# What other exceptions can occur when iterating a list?
# The following check is needed in general, however, and
# is easy to forget.
obj = PyErr_Occurred()
if obj is not NULL:
raise <object>obj
while obj is NULL:
self.right = next(self.rightseq)
key = self.rightkey(self.right)
obj = PyDict_GetItem(self.d, key)
self.matches = iter(<object>obj)
match = next(self.matches)
return (match, self.right) Using I haven't considered outer joins yet, but maybe you can use something from here. It would be okay for |
Hmm, the version that iterates over def __next__(self):
cdef PyObject *obj = NULL
if self.i == len(self.matches):
while obj is NULL:
self.right = next(self.rightseq)
key = self.rightkey(self.right)
obj = PyDict_GetItem(self.d, key)
self.matches = <object>obj
self.i = 0
match = <object>PyList_GET_ITEM(self.matches, self.i) # skip error checking
self.i += 1
return (match, self.right) This is beginning to split hairs. |
A right outer join is also pretty simple: def __next__(self):
cdef PyObject *obj
if self.i == len(self.matches):
self.right = next(self.rightseq)
key = self.rightkey(self.right)
obj = PyDict_GetItem(self.d, key)
if obj is NULL:
return (self.left_default, self.right)
self.matches = <object>obj
self.i = 0
match = <object>PyList_GET_ITEM(self.matches, self.i) # skip error checking
self.i += 1
return (match, self.right) Left outer join and full outer join are the more involved cases, and it gets pretty complicated when any of the joins are merged into a single function. Having the joins defined separately may be simpler to understand, because inner join and right outer join are so much easier to understand on their own. Some code repetition may be avoided by using the same initialization routine that is defined in a parent class. Subclassing from a single C extension class is straightforward. Heh, I hope you're not souring on Cython, @mrocklin. You always seem to tackle ugly, difficult cases! |
key, matches = next(self.d_items) | ||
while(key in self.seen_keys and matches): | ||
key, matches = next(self.d_items) | ||
self.key = key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.key
appears to be unnecessary.
I'm not sure that this is worth it
OK, I've pushed up a version with different |
I've change my old join to match the full outer join and then put in a few checks to make it cover all cases. We now have two viable options, splitting the joins by type and having one mega-join. Thoughts welcome. |
Have you ever done line profiling in Cython? I'm curious how much overhead the if checks take up. Also generally interested where the bottlenecks are. |
I have not done line profiling in Cython yet. Doing so would have added too much time to the initial development of The thing I don't like about the mega-join version is its use of recursion for inner joins and left outer joins. The maximum recursion depth can be exceeded when there are many consecutive keys from I'm not too concerned about optimizing
|
|
||
cdef class _inner_join(_join): | ||
def __iter__(self): | ||
self.matches = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't. Removed. It was old cruft that hung around.
One big case will be when |
In [1]: from toolz.curried import *
In [2]: import toolz
In [3]: import cytoolz
In [4]: small = [(i, str(i)) for i in range(100)] * 10
In [5]: big = pipe([110]*10000, map(range), concat, list)
In [6]: from cytoolz.itertoolz import _consume
In [7]: timeit _consume(toolz.join(0, small, identity, big))
1 loops, best of 3: 1.12 s per loop
In [8]: timeit _consume(cytoolz.itertoolz.join(0, small, identity, big))
1 loops, best of 3: 521 ms per loop First, hurrah, we got the magical 2x speedup. Second, why can't I import |
Cool, and this is with the mega-join version, right?
Add it to |
I've been considering how to best do this both in regards to the mega version and the split version. The mega version would probably just use |
It might be wise to try this out first on |
BTW, what's your schedule like these days? I'm curious how aggressively I need to budget my time here before SciPy. |
Agreed. Three subclasses--one for callable key, one for single index, and one for list of indices--will be needed for each function. This does not appeal to our Pythonic sensibilities. It is more boilerplate than we typically like to have, and the result for Here is the approach I'm considering: cdef class _join:
cdef object _rightkey
...
cdef class _right_outer_join(_join):
...
cdef class _right_outer_join_key(_right_outer_join):
cdef object rightkey
# This must match the signature of `__cinit__` in `_join`
def __cinit__(self,
object leftkey, object leftseq,
object rightkey, object rightseq,
object left_default=no_default,
object right_default=no_default):
self.rightkey = rightkey
cdef class _right_outer_join_index(_right_outer_join):
cdef inline object rightkey(self, object item):
return item[self._rightkey]
cdef class _right_outer_join_indices(_right_outer_join):
cdef object rightkey(self, object item):
# make tuple
... While not exactly pretty, it does achieve efficiency while avoiding duplicating most code. Hmm, I should make sure this works. |
A quick groupby profile. We see that getting items with a lambda takes around 20% in the toolz case, 35% in the cytoolz case. In [1]: from toolz import groupby
In [2]: data = [(i, i % 10) for i in range(1000000)]
In [3]: prun -s cumulative groupby(lambda x: x[1], data)
2000014 function calls in 0.489 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.015 0.015 0.489 0.489 <string>:1(<module>)
1 0.300 0.300 0.475 0.475 itertoolz.py:56(groupby)
1000000 0.113 0.000 0.113 0.000 <string>:1(<lambda>)
1000000 0.061 0.000 0.061 0.000 {method 'append' of 'list' objects}
10 0.000 0.000 0.000 0.000 itertoolz.py:81(<lambda>)
1 0.000 0.000 0.000 0.000 {callable}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
In [4]: from cytoolz import groupby
In [5]: prun -s cumulative groupby(lambda x: x[1], data)
1000003 function calls in 0.284 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.015 0.015 0.284 0.284 <string>:1(<module>)
1 0.168 0.168 0.269 0.269 {cytoolz.itertoolz.groupby}
1000000 0.101 0.000 0.101 0.000 <string>:1(<lambda>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} |
Oh, I didn't know that cytoolz supported function inlining. That changes things. Do we need separate classes? Any chance we can move functions around dynamically at runtime? :) |
Wait, no, I take it back. Significantly more gains seem possible. These results break my intuition gained above. I'm now less confident in my ability to infer from profiling results. In [9]: prun -s cumulative groupby(lambda x: x[1], data)
1000003 function calls in 0.290 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.015 0.015 0.290 0.290 <string>:1(<module>)
1 0.171 0.171 0.276 0.276 {cytoolz.itertoolz.groupby}
1000000 0.104 0.000 0.104 0.000 <string>:1(<lambda>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
In [10]: prun -s cumulative groupby(itemgetter(1), data)
3 function calls in 0.099 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.015 0.015 0.099 0.099 <string>:1(<module>)
1 0.084 0.084 0.084 0.084 {cytoolz.itertoolz.groupby}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} |
Pretty flexible. I completely overdid it over Labor Day weekend: I overloaded my allergies, which had me out of commission for nearly two weeks (note to self: don't go on long bike rides when farmers are plowing fields!). I still have a lingering cough, but at least I can sleep now. Anyway, I've been catching up on other things, and life is returning to normal. So. Schedule. My talk on functional programming in Python with PyToolz was accepted for PyOhio. This is in 6-7 weeks, and I will begin drafting it soon. Heh, I also plan to give a Lightning Talk on In other words, I'll be more active than I have been in recent weeks, and we can push to develop a version of |
Sorry to hear about the allergies. I had a mercifully brief incident earlier this week. It's alarming how bad life gets when swallowing becomes painful. Great about the PyOhio talk! Let me know if I can help in any way. |
Most of my concern about breaking out into multiple implementations was that I didn't have inlining in mind, and so was picturing a dozen parallel implementations of |
Interesting profiling. How do
C functions are not first class citizens, although pointers to C functions should work :-P |
See #33 for further benchmark discussion |
Oh, right, we should simply use
Thanks! I'll definitely post the slides for review long before the conference. I'm wondering how much I should use (not copy/paste) from existing documentation and presentations. The examples from them are great. I think first I will need to develop the story arc for my presentation, then thread in suitable content from existing material. I want this talk to be my own, but it would be stupid to ignore available resources. |
@eriknw thoughts on merging this (once I handle conflicts) without adding the more efficient bits? I'd like to publish the streaming analytics blogpost soon and might not have time soon to take care of the inlined solution here. |
Merging a working version soon sounds great, and maybe I can explore the inline version once other issues are addressed. How soon do you want to publish the blogpost? I think we should push for a release of
The two things I would like |
Also, you may want to wait until I merge #38 before resolving conflicts. |
Conflicts: cytoolz/itertoolz.pyx cytoolz/tests/test_itertoolz.py
Conflicts resolved. |
Excellent. Merging so we can move onto other things. We can explore the more efficient approach later. |
This is ugly and likely inefficient. Help.
toolz.join
has some pretty complexyield
stuff going on. The work here is pretty much just writing down state explicitly to handle all of that. It's pretty ugly and I assume that it's not any more efficient than the Pure Python solution. Tips appreciated.