Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV parsing: ComputeError #15854

Open
2 tasks done
CameronBieganek opened this issue Apr 23, 2024 · 3 comments
Open
2 tasks done

CSV parsing: ComputeError #15854

CameronBieganek opened this issue Apr 23, 2024 · 3 comments
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@CameronBieganek
Copy link

CameronBieganek commented Apr 23, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Use the following CSV file:

"serial_number","data_date","data_latitude","data_longitude","ign_status","is_power_on","is_zone_1_active","is_zone_1_door_open","unit_mode_detail","engine_hours","electrical_hours","engine_rpm","voltage","ambient_temperature","set_point_1","discharge_air_1","return_air_1","power_off_description","system_operating_mode","zone_1_control_condition"
"6001320386",2021-10-11 20:02:47.000,35.464762,-97.542528,false,False,,False,,6359,0,,13.57,,,,,Countdown,,

And the following Python script:

import polars as pl

schema = {
    "serial_number": pl.Utf8,
    "data_date": pl.Datetime,
    "data_latitude": pl.Float64,
    "data_longitude": pl.Float64,
    "ign_status": pl.Boolean,
    "is_power_on": pl.Boolean,
    "is_zone_1_active": pl.Boolean,
    "is_zone_1_door_open": pl.Boolean,
    "unit_mode_detail": pl.Utf8,
    "system_operating_mode": pl.Utf8,
    "zone_1_control_condition": pl.Utf8,
    "power_off_description": pl.Utf8,
    "engine_hours": pl.Float64,
    "electrical_hours": pl.Float64,
    "engine_rpm": pl.Float64,
    "voltage": pl.Float64,
    "ambient_temperature": pl.Float64,
    "set_point_1": pl.Float64,
    "discharge_air_1": pl.Float64,
    "return_air_1": pl.Float64
}

data = pl.read_csv("test.csv", schema=schema)

Output:

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
File ~\projects\polars_env\csv_parsing_bug.py:28
      3 import polars as pl
      5 schema = {
      6     "serial_number": pl.Utf8,
      7     "data_date": pl.Datetime,
   (...)
     25     "return_air_1": pl.Float64
     26 }
---> 28 data = pl.read_csv("test.csv", schema=schema)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:416, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    404         dtypes = {
    405             new_to_current.get(column_name, column_name): column_dtype
    406             for column_name, column_dtype in dtypes.items()
    407         }
    409 with prepare_file_arg(
    410     source,
    411     encoding=encoding,
   (...)
    414     storage_options=storage_options,
    415 ) as data:
--> 416     df = _read_csv_impl(
    417         data,
    418         has_header=has_header,
    419         columns=columns if columns else projection,
    420         separator=separator,
    421         comment_prefix=comment_prefix,
    422         quote_char=quote_char,
    423         skip_rows=skip_rows,
    424         dtypes=dtypes,
    425         schema=schema,
    426         null_values=null_values,
    427         missing_utf8_is_empty_string=missing_utf8_is_empty_string,
    428         ignore_errors=ignore_errors,
    429         try_parse_dates=try_parse_dates,
    430         n_threads=n_threads,
    431         infer_schema_length=infer_schema_length,
    432         batch_size=batch_size,
    433         n_rows=n_rows,
    434         encoding=encoding if encoding == "utf8-lossy" else "utf8",
    435         low_memory=low_memory,
    436         rechunk=rechunk,
    437         skip_rows_after_header=skip_rows_after_header,
    438         row_index_name=row_index_name,
    439         row_index_offset=row_index_offset,
    440         sample_size=sample_size,
    441         eol_char=eol_char,
    442         raise_if_empty=raise_if_empty,
    443         truncate_ragged_lines=truncate_ragged_lines,
    444         decimal_comma=decimal_comma,
    445     )
    447 if new_columns:
    448     return _update_columns(df, new_columns)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:559, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    555         raise ValueError(msg)
    557 projection, columns = parse_columns_arg(columns)
--> 559 pydf = PyDataFrame.read_csv(
    560     source,
    561     infer_schema_length,
    562     batch_size,
    563     has_header,
    564     ignore_errors,
    565     n_rows,
    566     skip_rows,
    567     projection,
    568     separator,
    569     rechunk,
    570     columns,
    571     encoding,
    572     n_threads,
    573     path,
    574     dtype_list,
    575     dtype_slice,
    576     low_memory,
    577     comment_prefix,
    578     quote_char,
    579     processed_null_values,
    580     missing_utf8_is_empty_string,
    581     try_parse_dates,
    582     skip_rows_after_header,
    583     parse_row_index_args(row_index_name, row_index_offset),
    584     sample_size=sample_size,
    585     eol_char=eol_char,
    586     raise_if_empty=raise_if_empty,
    587     truncate_ragged_lines=truncate_ragged_lines,
    588     decimal_comma=decimal_comma,
    589     schema=schema,
    590 )
    591 return wrap_df(pydf)

ComputeError: could not parse `Countdown` as dtype `f64` at column 'set_point_1' (column number 18)

The current offset in the file is 457 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Countdown` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@CameronBieganek CameronBieganek added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 23, 2024
@CameronBieganek
Copy link
Author

CameronBieganek commented Apr 23, 2024

Note that scan_csv works, like this:

data = pl.scan_csv("test.csv", schema=schema)

...where the file and the schema dictionary are the same as above. I'm guessing the error with read_csv is happening because the column order in the schema does not match the column order in the CSV? Normally I expect the order of entries in a dictionary to be immaterial, although technically as of Python 3.6 the built-in dictionary preserves insertion order.

I have a very similar issue open already. Basically this comes down to very poor error messages when the schema argument is involved. Not to mention, the docstring entry for schema could be more explicit about the requirements: e.g. order of entries in the dictionary must match the order of the columns.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Apr 23, 2024

That is odd.

Just a visualization of how schema is treated differently in read and scan:

import tempfile
import polars as pl

f = tempfile.NamedTemporaryFile()
f.write(b"""
A,B
1,2
""".strip())
f.seek(0)

pl.read_csv(f.name, schema={"B": pl.String, "A": pl.Int32})
# shape: (1, 2)
# ┌─────┬─────┐
# │ B   ┆ A   │
# │ --- ┆ --- │
# │ str ┆ i32 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

pl.scan_csv(f.name, schema={"B": pl.String, "A": pl.Int32}).collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i32 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

[Update]: - It seems #11723 contains a mention of it.

Found in the redesign issue:

@bradfordlynch
Copy link
Contributor

Ran into this issue as well. It is particularly surprising because dict types are accepted for the schema argument yet they do not guarantee the ordering on their keys. I messed around with various fixes until thinking that maybe it was the order of the keys that was causing my problems. I've created a PR to improve the documentation until this is fixed. For reference here is a minimal demonstration of the issue:

from io import StringIO

import polars as pl

csv = """A,B
1,"foo"
3,"bar"
"""

buf = StringIO(csv)

# Works fine
schema_good = {"A": pl.Int64, "B": pl.String}
pl.read_csv(buf, schema=schema_good)

# Raises ComputeError
buf.seek(0)
schema_bad = {"B": pl.String, "A": pl.Int64}
pl.read_csv(buf, schema=schema_bad)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants