`read_csv` fails on `schema` argument when `columns` is also provided #14227

mcrumiller · 2024-02-02T22:15:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import date

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": ["a", "b", "c"],
        "c": [date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3)],
    }
).write_csv("test.csv")

df = pl.read_csv("test.csv", columns=["a", "c"], schema={"a": pl.Int32, "c": pl.Date})

Log output

polars.exceptions.ComputeError: could not parse `a` as dtype `date` at column 'c' (column number 2)

Issue description

When the columns parameter is specified, the schema parameter ignores the dictionary and attempts to apply the values in the schema dictionary to the original columns.

Expected behavior

Schema should only be applied to supplied columns.

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.1
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-02-02T22:19:18Z

If this is intended behavior, we should rename schema to file_schema to indicate that it should define all the columns in the file, regardless of which columns are requested.

david-waterworth · 2024-02-03T00:50:32Z

@mcrumiller In my opinion it should behave as you originally expected, i.e. only apply to the selected columns. I generally treat csv files as "external/out of my control" so I want to ensure that they contain at least the columns I need (defined by columns), and they can be coerced to the types I expect (defined by schema) but I'm never surprised when someone has added additional columns and I wouldn't want my code to break in this case as it's impossible to predict in advance what someone might add.

I also think this is inconsistent, I'm pretty sure pl.from_dicts(data, schema=schema) will silently ignore fields that appear in any dict that's not defined in the schema? Again I prefer this as I'm getting dicts from a rest api which can change without warning - again as long as they don't remove fields I use or change the types I don't care.

Julian-J-S · 2024-02-03T09:56:25Z

You are right, this is confusing. read_csv is becoming huge parameter "monster" where many parameters influence each other and it is often not clear what the result will be 😅

"schema" usually means the complete file schema (see also pyspark) and will override/ignore the header and set the specified types in order.

So a csv with a header "a,b,c" and a schema (c: type, b: type, a: type) will ignore(!) the csv header and set names (c,b,a) and types on the order they appear in the file.

Determining the interaction between "schema" and other parameter like "columns", "new_columns" or "dtypes" should imo either be documented very precisely or NOT allowed 🚫! 🤓
Many operations can be done using polars expressions after reading the csv.

stinodego · 2024-06-08T09:41:40Z

There is no bug here. You should be using schema_overrides instead of schema. The error message could be better though.

See #15431 (comment)

Closing this one.

mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 2, 2024

alexander-beedie added the A-io-csv Area: reading/writing CSV files label Feb 3, 2024

mcrumiller mentioned this issue Mar 13, 2024

when "columns =" is specified, "pl.read_csv()" doesn't import columns based on the specified order of "columns = " #15027

Closed

2 tasks

CanglongCl mentioned this issue Apr 2, 2024

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024

stinodego added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_csv` fails on `schema` argument when `columns` is also provided #14227

`read_csv` fails on `schema` argument when `columns` is also provided #14227

mcrumiller commented Feb 2, 2024 •

edited

Loading

mcrumiller commented Feb 2, 2024

david-waterworth commented Feb 3, 2024

Julian-J-S commented Feb 3, 2024

stinodego commented Jun 8, 2024

read_csv fails on schema argument when columns is also provided #14227

read_csv fails on schema argument when columns is also provided #14227

Comments

mcrumiller commented Feb 2, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

mcrumiller commented Feb 2, 2024

david-waterworth commented Feb 3, 2024

Julian-J-S commented Feb 3, 2024

stinodego commented Jun 8, 2024

`read_csv` fails on `schema` argument when `columns` is also provided #14227

`read_csv` fails on `schema` argument when `columns` is also provided #14227

mcrumiller commented Feb 2, 2024 •

edited

Loading