Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] ArrowInvalid: cannot construct ChunkedArray from empty vector and omitted type #3633

Open
nick-youngblut opened this issue Jan 27, 2025 · 7 comments
Assignees

Comments

@nick-youngblut
Copy link

Describe the bug

The error when running tiledbsoma.io.from_anndata:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[28], line 2
      1 # ingest new data
----> 2 tiledbsoma.io.from_anndata(
      3     db_uri,
      4     adata,
      5     measurement_name="RNA",
      6     registration_mapping=rd,
      7 )

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py:567, in from_anndata(experiment_uri, anndata, measurement_name, context, platform_config, obs_id_name, var_id_name, X_layer_name, raw_X_layer_name, ingest_mode, use_relative_uri, X_kind, registration_mapping, uns_keys, additional_metadata)
    557 _maybe_ingest_uns(
    558     measurement,
    559     anndata.uns,
   (...)
    562     **ingest_platform_ctx,
    563 )
    565 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    566 # MS/meas/VAR
--> 567 with _write_dataframe(
    568     _util.uri_joinpath(measurement_uri, "var"),
    569     conversions.obs_or_var_to_tiledb_supported_array_type(anndata.var),
    570     id_column_name=var_id_name,
    571     # Layer existence is pre-checked in the registration phase
    572     axis_mapping=jidmaps.var_axes[measurement_name],
    573     **ingest_platform_ctx,
    574 ) as var:
    575     _maybe_set(measurement, "var", var, use_relative_uri=use_relative_uri)
    577 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    578 # MS/meas/X/DATA

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py:1287, in _write_dataframe(df_uri, df, id_column_name, ingestion_params, additional_metadata, platform_config, context, axis_mapping)
   1284 df[SOMA_JOINID] = np.asarray(axis_mapping.data, dtype=np.int64)
   1285 df.set_index(SOMA_JOINID, inplace=True)
-> 1287 return _write_dataframe_impl(
   1288     df,
   1289     df_uri,
   1290     id_column_name,
   1291     shape=axis_mapping.get_shape(),
   1292     ingestion_params=ingestion_params,
   1293     additional_metadata=additional_metadata,
   1294     original_index_metadata=original_index_metadata,
   1295     platform_config=platform_config,
   1296     context=context,
   1297 )

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py:1329, in _write_dataframe_impl(df, df_uri, id_column_name, shape, ingestion_params, additional_metadata, original_index_metadata, platform_config, context)
   1325     if id_column_name is None:
   1326         # Nominally, nil id_column_name only happens for uns append and we do not append uns,
   1327         # which is a concern for our caller. This is a second-level check.
   1328         raise ValueError("internal coding error: id_column_name unspecified")
-> 1329     arrow_table = _extract_new_values_for_append(df_uri, arrow_table, context)
   1331 try:
   1332     # Note: tiledbsoma.io creates dataframes with soma_joinid being the one
   1333     # and only index column.
   1334     domain = ((0, shape - 1),)

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py:1222, in _extract_new_values_for_append(df_uri, arrow_table, context)
   1218 try:
   1219     with _factory.open(
   1220         df_uri, "r", soma_type=DataFrame, context=context
   1221     ) as previous_soma_dataframe:
-> 1222         return _extract_new_values_for_append_aux(
   1223             previous_soma_dataframe, arrow_table
   1224         )
   1226 except DoesNotExistError:
   1227     return arrow_table

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py:1182, in _extract_new_values_for_append_aux(previous_soma_dataframe, arrow_table)
   1173         column = pa.chunked_array(
   1174             [chunk.dictionary_decode() for chunk in column.chunks]
   1175         )
   1177     elif is_cat(old_field) and not is_cat(new_field):
   1178         # Convert from non-categorical to categorical.  Note:
   1179         # libtiledbsoma already merges the enum mappings, e.g if the
   1180         # storage has red, yellow, & green, but our new data has some
   1181         # yellow, green, and orange.
-> 1182         column = pa.chunked_array(
   1183             [chunk.dictionary_encode() for chunk in column.chunks]
   1184         )
   1186     fields_dict[name] = column
   1187 arrow_table = pa.Table.from_pydict(fields_dict)

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/pyarrow/table.pxi:1537, in pyarrow.lib.chunked_array()

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/miniforge3/envs/tiledb/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: cannot construct ChunkedArray from empty vector and omitted type

To Reproduce

import tiledbsoma
import tiledbsoma.io

import scanpy as sc

input_path = "/home/nickyoungblut/dev/tmp/tiledb/2025-01-24_23-55-08/STAR/SRX21101392/Gene/filtered"
srx_accession = "SRX21101392"

# Read 10x mtx data
adata = sc.read_10x_mtx(
    input_path,
    var_names="gene_ids",  
    make_unique=True  
)

# add SRX column
adata.obs["SRX_accession"] = [srx_accession] * len(adata.obs)

# create tiledb soma db
db_dir = os.path.join(work_dir, "MY_DATABASE")

## read from temp location and write to tiledb
if os.path.exists(db_dir):
    db_uri = db_dir
else:
    # write adata file to temp location
    temp_dir = tempfile.mkdtemp()
    h5ad_file = os.path.join(temp_dir, "adata.h5ad")
    adata.write_h5ad(h5ad_file)

    ## create db
    db_uri = tiledbsoma.io.from_h5ad(
        db_dir, 
        input_path = h5ad_file,
        measurement_name = "RNA"
    )


# append more data
input_path = "/home/nickyoungblut/dev/tmp/tiledb/2025-01-22_01-10-09/STAR/SRX24099779/Gene/filtered"
srx_accession = "SRX24099779"

# Read 10x mtx data
adata = sc.read_10x_mtx(
    input_path,
    var_names="gene_ids",  
    make_unique=True  
)

# add SRX column
adata.obs["SRX_accession"] = [srx_accession] * len(adata.obs)

# register
rd = tiledbsoma.io.register_anndatas(
    db_uri,
    [adata],
    measurement_name="RNA",
    obs_field_name="obs_id",
    var_field_name="var_id",
)

# apply resize
with tiledbsoma.Experiment.open(db_uri) as exp:
    tiledbsoma.io.resize_experiment(
        exp.uri, 
        nobs=rd.get_obs_shape(), 
        nvars=rd.get_var_shapes()
    )

# ingest new data into the db
tiledbsoma.io.from_anndata(
    db_uri,
    adata,
    measurement_name="RNA",
    registration_mapping=rd,
)

Versions (please complete the following information):

  • TileDB-SOMA version: 1.15.4
  • Language and language version (e.g. Python 3.9, R 4.3.2): 3.12.8
  • OS (e.g. MacOS, Ubuntu Linux): Ubuntu

Additional context

The data is ingested, but there is still the error. So, I have to use:

try:
    tiledbsoma.io.from_anndata(
        db_uri,
        adata,
        measurement_name="RNA",
        registration_mapping=rd,
    )
except ArrowInvalid:
    pass

I'm wondering if the issue is due to the 2nd dataset gene set matching perfectly with the 1st gene set, so there are zero rows to add (empty vector).

@johnkerl johnkerl self-assigned this Jan 27, 2025
@johnkerl
Copy link
Member

I'm wondering if the issue is due to the 2nd dataset gene set matching perfectly with the 1st gene set, so there are zero rows to add (empty vector).

Hi @nick-youngblut ! It's definitely the case that there aren't new rows to add -- given you're registering with

    obs_field_name="obs_id",
    var_field_name="var_id",

and your

# add SRX column
adata.obs["SRX_accession"] = [srx_accession] * len(adata.obs)

didn't modify obs_id, so there are no new rows in obs.

So that's one issue -- if you want to add more obs rows, either mutate the new rows' obs_id, or, use a different column name (other than obs_id) to tell us which column is the ID column for obs.

The other issue though -- you shoudn't be getting the ChunkedArray error message we're giving you. This feels like a bug on our part. I did try a simple ingest-and-register-and-ingest again with the exact same (non-10X) dataset and didn't get the error you did, so there must be something a bit corner-casey going on here.

I'll investigate.

@johnkerl
Copy link
Member

A third possible issue: append-mode ingest is for more data with all the same column schema. Does adata.obs["SRX_accession"] exist in your original (before-the-append) data from 10X?

@nick-youngblut
Copy link
Author

Thanks @johnkerl for the detailed feedback!

Does adata.obs["SRX_accession"] exist in your original (before-the-append) data from 10X?

It does. Still, I will double check.

I'll also do some investigation on my end. If you need the data, I could provide it, since the data is already published on the SRA (hence, the SRX accessions).

@johnkerl
Copy link
Member

@nick-youngblut yes, if it's not too much to ask, having access to the data would indeed be super-helpful 🙏

@nick-youngblut
Copy link
Author

I haven't been able to reproduce this issue. I'm not sure why it happened, but it hasn't happened since. 🤷

@cbrueffer
Copy link

cbrueffer commented Jan 31, 2025

I'm seeing the same issue (after fixing #3641) in a similar write-then-append scenario. All datasets have the same obs columns schema, and the same gene IDs (padded across datasets, introducing NaNs in X and var columns for the padded genes). I haven't been able to come up with minimal datasets yet, but will post here once that happens.

@cbrueffer
Copy link

cbrueffer commented Jan 31, 2025

Here are some updates: Using the code linked in the above PR (just re-written to ingest using from_anndata) I'm testing with two datasets, both have 40405 genes. Doing write-then-append, with the full 40405 genes I get the ChunkedArray error. If I subset the datasets to 40404 genes, things work successfully. Could be an off-by-one problem somewhere?

Update: it worked for those two files, but appending others subset in the same way started failing with the ChunkedArray issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants