You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ivirshup , working on the cellxgene-census builder, has some new dataframes which have nullable-int columns.
He finds that when such data are read from TileDB-SOMA to Arrow using arrow_table = soma_dataframe.read().concat(), nullability of the int columns is conveyed correctly, but when a user then does pandas_dataframe = arrow_table.to_pandas(), nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.
The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on soma_dataframe.read(), so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.
Context: nullability in Pandas, Arrow, and TileDB
There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the libtiledbsoma codebase.)
Nullability in Pandas:
You can't see a nullability flag via printing .dtypes on a Pandas dataframe
If a Pandas dataframe has a nullable Pandas int column, you can see NA values in that column when you print the Pandas dataframe
If a Pandas dataframe has a non-nullable Pandas int column, you can see NaN values in that column when you print the Pandas dataframe
Nullability in Arrow tables:
Within the nanoarrow API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair "columnname": "nullable"
If an Arrow table has a nullable column, you can see this one of two ways in the pyarrow API:
Print arrow_table.schema -- non-nullable fields will say not null; nullable fields won't
More explictly, print out arrow_table.schema.field("field_name").nullable
Nullability in TileDB arrays:
You can see these one of two ways:
Open the array with tiledb-py's tiledb.open and print A.schema and look for nullable=True or nullable=False
Open the dataframe using tiledbsoma and print sdf.schema -- this is Arrow schema, so, as above, non-nullable fields will say not null; nullable fields won't
Context: conversions
Conversion from Pandas to Arrow:
E.g. arrow_from_pandas = pa.Table.from_pandas(pandas_df)
The output Arrow table will have a metadata key b'pandas' with JSON-encoded metadata value like this:
If the input Arrow table has a "pandas" metadata key-value pair then that is consulted: what should be an int column will be int, with null in it
If the input Arrow table does not have a "pandas" metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, with NaN in it -- THIS IS THE ISSUE (see below)
Conversion from Arrow to TileDB:
I.e. tiledbsoma.DataFrame.create
As noted in TileDB-SOMA PR 2868, we consult the Arrow schema's per-column nullable bit-flag, as well as the Arrow schema's per-table metadata (if present) in the form of {"column_name": "nullable"}
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability -- this is not broken
We do not copy any Arrow-table metadata into TileDB-array metadata
In particular, we do not copy the "pandas" metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadata
Conversion from TileDB to Arrow:
I.e. tiledbsoma.DataFrame.read
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability, and we correctly map TileDB-array nullability to Arrow-table nullability -- this is not broken
Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:
import pyarrow as pa
import pandas as pd
# ----------------------------------------------------------------
# Pandas has nullable integer types:
pandas_df = pd.DataFrame({"a": pd.Series([1, 2, None], dtype=pd.Int64Dtype())})
# >>> pandas_df.dtypes
# a Int64
# dtype: object
# >>> pandas_df
# a
# 0 1
# 1 2
# 2 <NA>
# ----------------------------------------------------------------
# As does Arrow:
arrow_table = pa.Table.from_pydict({"a": [1, 2, None]})
# >>> arrow_table.schema
# a: int64
# >>> arrow_table.schema.field('a').nullable
# True
# >>> arrow_table
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]
# >>> arrow_table.schema.metadata
# (nothing prints)
# ----------------------------------------------------------------
# But the to_pandas method on Arrow tables does not preserve the nullable type typically:
pandas_from_arrow = arrow_table.to_pandas()
# >>> pandas_from_arrow.dtypes
# a float64
# dtype: object
# >>> pandas_from_arrow
# a
# 0 1.0
# 1 2.0
# 2 NaN
# ----------------------------------------------------------------
# But if that table was created from a Pandas dataframe, it will:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)
# >>> arrow_from_pandas.schema
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 358
# >>> arrow_from_pandas.schema.metadata
# {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}],
# "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type":
# "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a",
# "pandas_type": "int64", "numpy_type": "Int64", "metadata": null}], "creator": {"library": "pyarrow",
# "version": "19.0.0"}, "pandas_version": "2.2.3"}'}
# >>> arrow_from_pandas
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]
# >>> arrow_from_pandas.schema.field('a').nullable
# True
# ----------------------------------------------------------------
# Round-tripping:
pandas_from_arrow_from_pandas = arrow_from_pandas.to_pandas()
# >>> pandas_from_arrow_from_pandas.dtypes
# a Int64
# dtype: object
# >>> pandas_from_arrow_from_pandas
# a
# 0 1
# 1 2
# 2 <NA>
He converts that to Arrow, and the resulting Arrow table's nullability is correctly conveyed from Pandas to Arrow (as described above)
He ingests that to TileDB-SOMA, and the resulting soma dataframe's nullability is correctly conveyed from Arrow to TileDB (as described above)
A user later reads the TileDB-SOMA dataframe (as Arrow), and the resulting Arrow table's nullabilty is correctly conveyed from TileDB to Arrow (as described above)
The user than does .to_pandas() on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" above
This is not a TileDB-SOMA bug
Strictly speaking it's a Pandas/Arrow issue but @ivirshup asserts it's likely to remain that way. Specifically, the current behaviour is documented here so it is probably not considered a bug.
What we can do to help work around that Pandas/Arrow issue:
When we convert Arrow Table to soma DataFrame (at TileDB-SOMA create), copy over all Arrow-table metadata -- which will get the key-value pair with key b'pandas' and value which is a JSON-encoded string
When we convert TileDB-SOMA DataFrame to Arrow Table (at TileDB-SOMA read), copy all array metadata that does not start with "soma_" to the Arrow table we create
Then, when the user does .to_pandas() on the Arrow table they got from our .read() method, they'll get the intended nullabilities
Other issues / questions:
One is: At TileDB-SOMA DataFrame.create, given key named "pandas" (or "k"), should we have the TileDB-array metadata name be also "pandas" / "k" / etc.? Or maybe some prefix? (I don't care either way, we just need to decide)
A second one is: suppose there is data which did not come from Pandas: namely like this:
johnkerl
changed the title
Consider Pandas metadata to enhance nullability for .to_pandas() output
[python] Consider Pandas metadata to enhance nullability for .to_pandas() output
Jan 30, 2025
johnkerl
changed the title
[python] Consider Pandas metadata to enhance nullability for .to_pandas() output
[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output
Jan 30, 2025
Overview
@ivirshup , working on the
cellxgene-census
builder, has some new dataframes which have nullable-int columns.He finds that when such data are read from TileDB-SOMA to Arrow using
arrow_table = soma_dataframe.read().concat()
, nullability of the int columns is conveyed correctly, but when a user then doespandas_dataframe = arrow_table.to_pandas()
, nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on
soma_dataframe.read()
, so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.Context: nullability in Pandas, Arrow, and TileDB
There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the
libtiledbsoma
codebase.)Nullability in Pandas:
.dtypes
on a Pandas dataframeNA
values in that column when you print the Pandas dataframeNaN
values in that column when you print the Pandas dataframeNullability in Arrow tables:
nanoarrow
API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair"columnname": "nullable"
pyarrow
API:arrow_table.schema
-- non-nullable fields will saynot null
; nullable fields won'tarrow_table.schema.field("field_name").nullable
Nullability in TileDB arrays:
tiledb.open
and printA.schema
and look fornullable=True
ornullable=False
tiledbsoma
and printsdf.schema
-- this is Arrow schema, so, as above, non-nullable fields will saynot null
; nullable fields won'tContext: conversions
Conversion from Pandas to Arrow:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)
b'pandas'
with JSON-encoded metadata value like this:Conversion from Arrow to Pandas:
pandas_from_arrow = arrow_table.to_pandas()
"pandas"
metadata key-value pair then that is consulted: what should be an int column will be int, with null in it"pandas"
metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, withNaN
in it -- THIS IS THE ISSUE (see below)Conversion from Arrow to TileDB:
tiledbsoma.DataFrame.create
{"column_name": "nullable"}
"pandas"
metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadataConversion from TileDB to Arrow:
tiledbsoma.DataFrame.read
Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:
Specific issue
@ivirshup's issue is the following:
.to_pandas()
on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" aboveDiagramatically:
Possible solutions
One solution considered
What we can do to help work around that Pandas/Arrow issue:
create
), copy over all Arrow-table metadata -- which will get the key-value pair with keyb'pandas'
and value which is a JSON-encoded stringDataFrame
to Arrow Table (at TileDB-SOMAread
), copy all array metadata that does not start with"soma_"
to the Arrow table we create.to_pandas()
on the Arrow table they got from our.read()
method, they'll get the intended nullabilitiesOther issues / questions:
One is: At TileDB-SOMA
DataFrame.create
, given key named"pandas"
(or"k"
), should we have the TileDB-array metadata name be also"pandas"
/"k"
/ etc.? Or maybe some prefix? (I don't care either way, we just need to decide)A second one is: suppose there is data which did not come from Pandas: namely like this:
In this case, the TileDB-array metadata will not have any metadata with key
"pandas"
.Then, on our
.read()
method, we could do this:"pandas"
, then copy that key-value pair to the output Arrow Table"pandas"
, then, we should create oneAnother solution considered
A much simpler proposal from @nguyenv :
create
), don't copy any metadata from Arrow from TileDBread
), don't copy any metadata from TileDB to ArrowBenefits of this second solution:
See also
More about nullability tracking in TileDB-SOMA more broadly: #2858.
The text was updated successfully, but these errors were encountered: