You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am sharing my experience and looking for recommendations for indexing into dense tiledb.Arrays with arbitrary integer indices (np.ndarray).
Background
I am considering tiledb as a unified data backend for a pipeline that can work either on streaming data from disk or in-memory data (mem://). Each step operates on an immutable input and creates a new (collection of) 2D tiledb.Array. Some of our operations select rows/columns based on some (possibly cheap) criterion. To avoid unnecessary creation of arrays, I played around with wrapping the tiledb.Arrays in a view class with an optional selection (either a slice or np.ndarray). My hope was to minimize creation of new tiledb.Array instances and instead utilize those views as much as possible and defer calling tiledb.Array.__getitem__ or tiledb.Array.multi_index until I actually work with data (instead of just subsetting). Doing this, I ran into a scenario where getting the data out of tiledb.Arrays became extremely slow. I was able to reproduce this with the following example:
Create 10k by 10k np.ndarray.
Copy into a mem://tiledb.Array
Access array via multi_index[np.arange(10k), np.arange(10k)] (fast)
Access array via multi_index[np.arange(10k)[::2], np.arange(10k)[::2]] (slow)
Minimal Working Example
I have a minimal working example with requirements.txthere
data.dtype=dtype('int64') data.size=100000000
step=None ref 521ms
step=None indices 535ms
step=None indices split 825ms
step=None slices 579ms
step=None slices split 881ms
step=2 ref 523ms
step=2 indices 22383ms
step=2 indices split 1194ms
step=2 slices 16637ms
step=2 slices split 933ms
Observation:
tiledb.Array.multi_index[np.ndarray, np.ndarray] can be extremely slow. This is surprising because it can be a lot faster for some cases, even though much more data is loaded
I can probably get around this with tiledb.Array.multi_index[np.ndarray, slice(None)][""][:, np.ndarray]. This seems to be reasonably fast and is likely compatible with our access pattern.
I figure that my example here is probably near worst case performance. I tried to debug into tiledb code as much as possible, but I was not able to follow into anything implemented in C++. I am not sure what would need to change on the tiledb.Array implementation to optimize such a use case as presented here.
The text was updated successfully, but these errors were encountered:
Summary
I am sharing my experience and looking for recommendations for indexing into dense
tiledb.Array
s with arbitrary integer indices (np.ndarray
).Background
I am considering tiledb as a unified data backend for a pipeline that can work either on streaming data from disk or in-memory data (
mem://
). Each step operates on an immutable input and creates a new (collection of) 2Dtiledb.Array
. Some of our operations select rows/columns based on some (possibly cheap) criterion. To avoid unnecessary creation of arrays, I played around with wrapping thetiledb.Array
s in a view class with an optional selection (either aslice
ornp.ndarray
). My hope was to minimize creation of newtiledb.Array
instances and instead utilize those views as much as possible and defer callingtiledb.Array.__getitem__
ortiledb.Array.multi_index
until I actually work with data (instead of just subsetting). Doing this, I ran into a scenario where getting the data out oftiledb.Array
s became extremely slow. I was able to reproduce this with the following example:np.ndarray
.mem://
tiledb.Array
multi_index[np.arange(10k), np.arange(10k)]
(fast)multi_index[np.arange(10k)[::2], np.arange(10k)[::2]]
(slow)Minimal Working Example
I have a minimal working example with
requirements.txt
hereThis is the Python script:
with example output
Observation:
tiledb.Array.multi_index[np.ndarray, np.ndarray]
can be extremely slow. This is surprising because it can be a lot faster for some cases, even though much more data is loadedtiledb.Array.multi_index[np.ndarray, slice(None)][""][:, np.ndarray]
. This seems to be reasonably fast and is likely compatible with our access pattern.tiledb.Array
implementation to optimize such a use case as presented here.The text was updated successfully, but these errors were encountered: