-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): read_lazy
for whole AnnData
lazy-loading + xarray
reading + read_elem_as_dask
-> read_elem_lazy
#1247
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1247 +/- ##
==========================================
- Coverage 86.99% 85.20% -1.79%
==========================================
Files 40 45 +5
Lines 6053 6457 +404
==========================================
+ Hits 5266 5502 +236
- Misses 787 955 +168
|
68fcd2b
to
6165f07
Compare
…ocstring" This reverts commit 79d3fdc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make a PR into the notebook repo, there are some things broken in the notebook that I’d like to properly review.
Also I’m still not a fan of the exports from private namespaces, but I’m OK with it if there are at least minimal docstrings for them.
This is very close! Thanks for taking on this huge thing!
experimental.backed._lazy_arrays.MaskedArray | ||
experimental.backed._lazy_arrays.CategoricalArray | ||
experimental.backed._xarray.Dataset2D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think they only work for param docs. I should get around to check if I can extend it to work for pure Sphinx as well.
assert index.dtype != int, msg | ||
from ..experimental.backed._compat import xr | ||
|
||
# TODO: why is this here? All tests pass without it and it seems at the minimum not strict enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asserts in runtime code are purely there to make debugging easier in case the asserts fail.
Hmm, so it’s not possible that _normalize_index
ever gets called with the wrong type of index as result of user action?
If that’s impossible, feel free to remove. Otherwise we should probably update this check to a TypeError or so.
dtype = "object" | ||
else: | ||
dtype = col.dtype.numpy_dtype | ||
# TODO: get good chunk size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small reminder in case you have an idea for this TODO
tests/test_read_lazy.py
Outdated
|
||
|
||
# remote has object dtype, need to convert back for integers booleans etc. | ||
def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does remote have object dtypes? can that be fixed ahead of time? will that affect users or is that just here in the tests?
also for clarity, please call it fix_extension_dtype_differences
or unify_extension_dtypes
and make the comment a docstring. I was unsure if correct
was meant as a verb or something like make_correct_...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name and add a better comment since it's only used for concat. There is no way of concat-ing these lazy pandas adapter arrays we have so in order to work with them, we present a dask array wrapping each when they are concatenated. Practically, this actually doesn't change anything in terms of IO. But it means that when we make the round trip, we lose the original data type. We could maybe store this in uns
or something (what the original data type was) but
- People concating remote/large datasets probably won't be reading them into memory very often
- This gets complicated quickly with mixed data types. So if you have two similarly named columns but different numerical data types, we need to start dealing with upcasting/downcasting.
Feels like a cross-that-bridge-when-we-come-to-it issue
tests/test_read_lazy.py
Outdated
return request.param | ||
|
||
|
||
@pytest.fixture(params=[True, False], scope="session") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you mark this as solved?
Co-authored-by: Philipp A. <[email protected]>
Co-authored-by: Philipp A. <[email protected]>
@flying-sheep scverse/anndata-tutorials#20 notebook PR but it seems to run ok for me? Could you post your error? |
This PR is a lighter weight version of #947 that involves using the original
AnnData
object as the class to holdobs
andvar
xr.Dataset
..obs
and.var
withbacked="r"
mode #981