Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output #3642

Open
johnkerl opened this issue Jan 30, 2025 · 0 comments
Assignees

Comments

@johnkerl
Copy link
Member

johnkerl commented Jan 30, 2025

Overview

@ivirshup , working on the cellxgene-census builder, has some new dataframes which have nullable-int columns.

He finds that when such data are read from TileDB-SOMA to Arrow using arrow_table = soma_dataframe.read().concat(), nullability of the int columns is conveyed correctly, but when a user then does pandas_dataframe = arrow_table.to_pandas(), nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.

The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on soma_dataframe.read(), so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.

Context: nullability in Pandas, Arrow, and TileDB

There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the libtiledbsoma codebase.)

Nullability in Pandas:

  • You can't see a nullability flag via printing .dtypes on a Pandas dataframe
  • If a Pandas dataframe has a nullable Pandas int column, you can see NA values in that column when you print the Pandas dataframe
  • If a Pandas dataframe has a non-nullable Pandas int column, you can see NaN values in that column when you print the Pandas dataframe

Nullability in Arrow tables:

  • Within the nanoarrow API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair "columnname": "nullable"
  • If an Arrow table has a nullable column, you can see this one of two ways in the pyarrow API:
    • Print arrow_table.schema -- non-nullable fields will say not null; nullable fields won't
    • More explictly, print out arrow_table.schema.field("field_name").nullable

Nullability in TileDB arrays:

  • You can see these one of two ways:
    • Open the array with tiledb-py's tiledb.open and print A.schema and look for nullable=True or nullable=False
    • Open the dataframe using tiledbsoma and print sdf.schema -- this is Arrow schema, so, as above, non-nullable fields will say not null; nullable fields won't

Context: conversions

Conversion from Pandas to Arrow:

  • E.g. arrow_from_pandas = pa.Table.from_pandas(pandas_df)
  • The output Arrow table will have a metadata key b'pandas' with JSON-encoded metadata value like this:
{
  "pandas": {
    "index_columns": [
      {
        "kind": "range",
        "name": null,
        "start": 0,
        "stop": 3,
        "step": 1
      }
    ],
    "column_indexes": [
      {
        "name": null,
        "field_name": null,
        "pandas_type": "unicode",
        "numpy_type": "object",
        "metadata": {
          "encoding": "UTF-8"
        }
      }
    ],
    "columns": [
      {
        "name": "a",
        "field_name": "a",
        "pandas_type": "int64",
        "numpy_type": "Int64",
        "metadata": null
      }
    ],
    "creator": {
      "library": "pyarrow",
      "version": "19.0.0"
    },
    "pandas_version": "2.2.3"
  }
}

Conversion from Arrow to Pandas:

  • E.g. pandas_from_arrow = arrow_table.to_pandas()
  • If the input Arrow table has a "pandas" metadata key-value pair then that is consulted: what should be an int column will be int, with null in it
  • If the input Arrow table does not have a "pandas" metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, with NaN in it -- THIS IS THE ISSUE (see below)

Conversion from Arrow to TileDB:

  • I.e. tiledbsoma.DataFrame.create
  • As noted in TileDB-SOMA PR 2868, we consult the Arrow schema's per-column nullable bit-flag, as well as the Arrow schema's per-table metadata (if present) in the form of {"column_name": "nullable"}
  • We do in all cases correctly map Arrow-table nullability into TileDB-array nullability -- this is not broken
  • We do not copy any Arrow-table metadata into TileDB-array metadata
  • In particular, we do not copy the "pandas" metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadata

Conversion from TileDB to Arrow:

  • I.e. tiledbsoma.DataFrame.read
  • We do in all cases correctly map Arrow-table nullability into TileDB-array nullability, and we correctly map TileDB-array nullability to Arrow-table nullability -- this is not broken

Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:

import pyarrow as pa
import pandas as pd

# ----------------------------------------------------------------
# Pandas has nullable integer types:
pandas_df = pd.DataFrame({"a": pd.Series([1, 2, None], dtype=pd.Int64Dtype())})

# >>> pandas_df.dtypes
# a    Int64
# dtype: object

# >>> pandas_df
#       a
# 0     1
# 1     2
# 2  <NA>

# ----------------------------------------------------------------
# As does Arrow:
arrow_table = pa.Table.from_pydict({"a": [1, 2, None]})

# >>> arrow_table.schema
# a: int64

# >>> arrow_table.schema.field('a').nullable
# True

# >>> arrow_table
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]

# >>> arrow_table.schema.metadata
# (nothing prints)

# ----------------------------------------------------------------
# But the to_pandas method on Arrow tables does not preserve the nullable type typically:
pandas_from_arrow = arrow_table.to_pandas()

# >>> pandas_from_arrow.dtypes
# a    float64
# dtype: object

# >>> pandas_from_arrow
#      a
# 0  1.0
# 1  2.0
# 2  NaN

# ----------------------------------------------------------------
# But if that table was created from a Pandas dataframe, it will:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)

# >>> arrow_from_pandas.schema
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 358

# >>> arrow_from_pandas.schema.metadata
# {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}],
# "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type":
# "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a",
# "pandas_type": "int64", "numpy_type": "Int64", "metadata": null}], "creator": {"library": "pyarrow",
# "version": "19.0.0"}, "pandas_version": "2.2.3"}'}

# >>> arrow_from_pandas
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]

# >>> arrow_from_pandas.schema.field('a').nullable
# True

# ----------------------------------------------------------------
# Round-tripping:
pandas_from_arrow_from_pandas = arrow_from_pandas.to_pandas()

# >>> pandas_from_arrow_from_pandas.dtypes
# a    Int64
# dtype: object

# >>> pandas_from_arrow_from_pandas
#       a
# 0     1
# 1     2
# 2  <NA>

Specific issue

@ivirshup's issue is the following:

  • He has Pandas data with nullable-int columns
  • He converts that to Arrow, and the resulting Arrow table's nullability is correctly conveyed from Pandas to Arrow (as described above)
  • He ingests that to TileDB-SOMA, and the resulting soma dataframe's nullability is correctly conveyed from Arrow to TileDB (as described above)
  • A user later reads the TileDB-SOMA dataframe (as Arrow), and the resulting Arrow table's nullabilty is correctly conveyed from TileDB to Arrow (as described above)
  • The user than does .to_pandas() on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" above
  • This is not a TileDB-SOMA bug
  • Strictly speaking it's a Pandas/Arrow issue but @ivirshup asserts it's likely to remain that way. Specifically, the current behaviour is documented here so it is probably not considered a bug.

Diagramatically:

pandas nullability ----conveyed---> arrow-table nullability ---conveyed---> tiledbsoma dataframe nullability

pandas nullability <-NOT CONVEYED-- arrow-table nullability <--conveyed---- tiledbsoma dataframe nullability

Possible solutions

One solution considered

What we can do to help work around that Pandas/Arrow issue:

  • When we convert Arrow Table to soma DataFrame (at TileDB-SOMA create), copy over all Arrow-table metadata -- which will get the key-value pair with key b'pandas' and value which is a JSON-encoded string
  • When we convert TileDB-SOMA DataFrame to Arrow Table (at TileDB-SOMA read), copy all array metadata that does not start with "soma_" to the Arrow table we create
  • Then, when the user does .to_pandas() on the Arrow table they got from our .read() method, they'll get the intended nullabilities

Other issues / questions:

One is: At TileDB-SOMA DataFrame.create, given key named "pandas" (or "k"), should we have the TileDB-array metadata name be also "pandas" / "k" / etc.? Or maybe some prefix? (I don't care either way, we just need to decide)

A second one is: suppose there is data which did not come from Pandas: namely like this:

                                    arrow-table nullability ---conveyed---> tiledbsoma dataframe nullability

pandas nullability <-NOT CONVEYED-- arrow-table nullability <--conveyed---- tiledbsoma dataframe nullability

In this case, the TileDB-array metadata will not have any metadata with key "pandas".

Then, on our .read() method, we could do this:

  • If the input tiledb array has metadata key "pandas", then copy that key-value pair to the output Arrow Table
  • But if the input tiledb array does not have metadata key "pandas", then, we should create one

Another solution considered

A much simpler proposal from @nguyenv :

  • When we do arrow -> tiledb (our create), don't copy any metadata from Arrow from TileDB
    • So, no codemods will be needed in the write path (i.e. the census builder)
  • When we go arrow <- tiledb (our read), don't copy any metadata from TileDB to Arrow
    • But do freshly compute a pandas key-value pair in the output Arrow-table schema
    • That helps us for the case when data did originate with Pandas, and when it did not

Benefits of this second solution:

  • It's simpler
  • It only affects the read path, i.e. we needn't delay the census-builder

See also

More about nullability tracking in TileDB-SOMA more broadly: #2858.

@johnkerl johnkerl self-assigned this Jan 30, 2025
@johnkerl johnkerl changed the title Consider Pandas metadata to enhance nullability for .to_pandas() output [python] Consider Pandas metadata to enhance nullability for .to_pandas() output Jan 30, 2025
@johnkerl johnkerl changed the title [python] Consider Pandas metadata to enhance nullability for .to_pandas() output [python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant