Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails on schema argument when columns is also provided #14227

Closed
2 tasks done
mcrumiller opened this issue Feb 2, 2024 · 4 comments
Closed
2 tasks done

read_csv fails on schema argument when columns is also provided #14227

mcrumiller opened this issue Feb 2, 2024 · 4 comments
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 2, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import date

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": ["a", "b", "c"],
        "c": [date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3)],
    }
).write_csv("test.csv")

df = pl.read_csv("test.csv", columns=["a", "c"], schema={"a": pl.Int32, "c": pl.Date})

Log output

polars.exceptions.ComputeError: could not parse `a` as dtype `date` at column 'c' (column number 2)

Issue description

When the columns parameter is specified, the schema parameter ignores the dictionary and attempts to apply the values in the schema dictionary to the original columns.

Expected behavior

Schema should only be applied to supplied columns.

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.1
xlsxwriter:           3.1.9
@mcrumiller mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 2, 2024
@mcrumiller
Copy link
Contributor Author

If this is intended behavior, we should rename schema to file_schema to indicate that it should define all the columns in the file, regardless of which columns are requested.

@david-waterworth
Copy link

@mcrumiller In my opinion it should behave as you originally expected, i.e. only apply to the selected columns. I generally treat csv files as "external/out of my control" so I want to ensure that they contain at least the columns I need (defined by columns), and they can be coerced to the types I expect (defined by schema) but I'm never surprised when someone has added additional columns and I wouldn't want my code to break in this case as it's impossible to predict in advance what someone might add.

I also think this is inconsistent, I'm pretty sure pl.from_dicts(data, schema=schema) will silently ignore fields that appear in any dict that's not defined in the schema? Again I prefer this as I'm getting dicts from a rest api which can change without warning - again as long as they don't remove fields I use or change the types I don't care.

@Julian-J-S
Copy link
Contributor

You are right, this is confusing. read_csv is becoming huge parameter "monster" where many parameters influence each other and it is often not clear what the result will be 😅

"schema" usually means the complete file schema (see also pyspark) and will override/ignore the header and set the specified types in order.

So a csv with a header "a,b,c" and a schema (c: type, b: type, a: type) will ignore(!) the csv header and set names (c,b,a) and types on the order they appear in the file.

Determining the interaction between "schema" and other parameter like "columns", "new_columns" or "dtypes" should imo either be documented very precisely or NOT allowed 🚫! 🤓
Many operations can be done using polars expressions after reading the csv.

@stinodego
Copy link
Member

There is no bug here. You should be using schema_overrides instead of schema. The error message could be better though.

See #15431 (comment)

Closing this one.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024
@stinodego stinodego added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants