-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoMap initialization from np.ndarray
appears to be much slower than from list
#6
Comments
Internally, For lists, this is a simple >>> keys_list = list(range(10_000))
>>> keys_array = np.arange(10_000)
>>> %timeit list(keys_list)
21.5 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit list(keys_array)
639 µs ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I think the rest of the difference can be attributed to the hashing speed of a Python >>> %timeit set(keys_list)
107 µs ± 982 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> keys_array_list = list(keys_array)
>>> %timeit set(keys_array_list)
234 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) (Note that |
I'm guessing that they store their keys as a NumPy array under-the-hood. Copying those is even faster than lists, since you just do the |
Thanks for these thoughts. Independent of what |
Yeah, that seems like a valuable avenue for exploration! |
Hi @brandtbucher : I have some ideas for how to optimize the scenario of creating My idea is to have AutoMap use Then, we implement a new What I am not sure about is whether this new Of course, since we will be using the NumPy API this will also make the entire package dependent on NumPy, but I am familiar with what is necessary from ArrayKit. Let me know your thoughts; happy to have a chat if easier to discuss. |
I'm still open to letting people explore this (especially since StaticFrame is our For example, we benefit greatly from the knowledge that the I'm also not entirely convinced of the benefit (something that an implementation and performance numbers alone can probably answer). We still have to build a hash table out of something, and without concrete Python objects, it seems to me that we either need to:
My current view is that this is a ton of work (and likely slowdown for the non-NumPy cases) for something that probably won't perform much better than just having the caller do TL;DR: Feel free to attempt an implementation, but I don't personally think it's worth our time and likely won't merge it if it adds significant complexity for negligible gain! |
Many thanks, @brandtbucher , for your thoughts on this. As the overwhelming use case in StaticFrame is initialization of FrozenAutoMaps from immutable NumPy arrays, I still think this is worth exploring. Of course, I absolutely do not want significant complexity for negligible gain! Your point about the additional cost of comparing and hashing NumPy scalars is certainly relevant. But I wonder if, instead of creating NumPy scalars, we can use underlying byte data (for non-object dtypes) instead of NumPy scalars. This is kinda like a short cut to what you refer to as "intimate knowledge of how to compute hash and equality for the full spectrum of NumPy dtypes", but is actually quite simple and performant to access and use for a known (non-object) dtype. I have not thought through all the implications of this but it is just a preliminary speculation. I will let you know if I can make any progress on this. |
It does seem attractive to just say "each n-length chunk of bytes is a key" and define simple(ish?) hash and equality functions for them, but that only works when compared to other chunks of bytes. As soon as types are involved, it gets much nastier. For example, how do I know if a (Python or NumPy) integer compares equal to a chunk of bytes in my floating-point array? If the answer is "convert it to a scalar of the correct dtype first, and use those bytes", then it seems like we're just shifting work from the initialization to every element access. AutoMap already goes to great lengths to avoid performing any extra work (allocations, equality comparisons between Python objects, etc.) during element access, and we still only beat Python's If we're okay limiting the element accesses to NumPy scalars of the exact same dtype, then sure, it could work and probably be performant. But that assumes that your users are okay with keeping any value that could potentially be looked up as a key in the mapping (strings, ints, etc.) as NumPy scalars, or else suffer a surprising performance hit when indexing. But now I'm weighing this additional burden on the user against the cost of just having SF do |
As a mostly-unrelated performance side note, somebody shared a talk on the super interesting design of Google's SwissTable, which has similar design goals as us but uses SIMD instructions in really interesting ways. I've been thinking about prototyping an implementation of it for AutoMap, but haven't found the time for it. So this situation could certainly be improved if I (or somebody else) picks up the work. |
Another note: if the list copy is shown to be expensive, a simple solution could be to provide an alternate constructor that just skips it and grabs a reference. |
Yes, I considered your point about how to compare non-NumPy types with a byte-encoded table and agree, it likely will end up just shifting the work... but maybe that is worth it. And I also thought about just holding a reference to the passed list, but having a lingering reference to a mutable outside of the AutoMap seems too sketchy! As far as the cost of the call >>> keys_array = np.arange(10_000)
>>> %timeit sf.Index(keys_array)
260 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit automap.FrozenAutoMap(keys_array.tolist())
247 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
>>> %timeit keys_array.tolist()
166 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> keys_list = keys_array.tolist()
>>> %timeit automap.FrozenAutoMap(keys_list)
85.3 µs ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) Further, and relating to some of your other comments, it might even be better to sacrifice some lookup performance for instantiation performance. As StaticFrame inevitably creates many derived containers (each with indices) we might end up spending more time creating indices than looking up values in them. |
Wow, that is expensive. I had no idea it accounted for two-thirds of the total creation time! I'm still wary of the impact on lookup performance, but you have convinced me that |
Updating the examples shown above. Current AutoMap on master:
With the changes of branch #16
|
In investigating StaticFrame performance I observed that initializing a
AutoMap
orFrozenAutoMap
from annp.ndarray
appears to be an order of magnitude slower than using a Pythonlist
. Even adding the overhead oftolist()
on the array delivers better overall performance. Should we only deliver lists to these constructors, or is there something we can optimize in the implementation of initialization from arrays?Notice also how the Pandas
Int64Index
has inverse characteristics (lists are slower than arrays), while initialization from an array is nearly an order of magnitude faster thanAutoMap
for lists.I wonder if
AutoMap
can take advantage of receiving immutable NumPy arrays from StaticFrame to avoid coping keys entirely.The text was updated successfully, but these errors were encountered: