Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv does not return columns in the order specified by columns parameter #13066

Open
2 tasks done
mcrumiller opened this issue Dec 15, 2023 · 10 comments
Open
2 tasks done
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 15, 2023

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from io import StringIO

csv = (
    "a,b,c\n"
    "1,2,3\n"
    "1,2,3\n"
)

df = pl.read_csv(StringIO(csv), columns=["b", "a", "c"])
print(df)
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┘

Issue description

When the columns parameter is specified in order to select out the columns, the order return corresponds to the original frame's order, not the order specified by columns.

Related Issues: #11186, #11535.

Expected behavior

Should return requested order.

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32 
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.4.0
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.7.1
numpy:                1.26.1
openpyxl:             3.1.2
pandas:               2.1.1
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.7
xlsx2csv:             0.8.1
xlsxwriter:           3.0.9
@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Dec 15, 2023
@stinodego stinodego added the accepted Ready for implementation label Dec 16, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Dec 16, 2023
@stinodego
Copy link
Member

Related to #13040

Happy to accept a fix for this one.

@romanovacca
Copy link
Contributor

romanovacca commented Dec 22, 2023

I want to pick this one up!
Do we want a simple solution on the python side, something like a

df.select(columns)

in the read_csv would solve it.

But I believe fixing it from the rust side would be better agree?

@stinodego
Copy link
Member

But I believe fixing it from the rust side would be better agree?

Yes! If this code path leads to the Rust side, it would have to be fixed there.

@stinodego stinodego added P-medium Priority: medium and removed accepted Ready for implementation labels Jan 12, 2024
@alexander-beedie alexander-beedie added the A-io-csv Area: reading/writing CSV files label Jan 23, 2024
@stinodego
Copy link
Member

As I noted in the related PR, the offending code seems to be here:

if let Some(cols) = columns {
let mut prj = Vec::with_capacity(cols.len());
for col in cols {
let i = schema.try_index_of(&col)?;
prj.push(i);
}
// update null values with projection
if let Some(nv) = null_values.as_mut() {
nv.apply_projection(&prj);
}
projection = Some(prj);
}

@stinodego
Copy link
Member

This should ideally be fixed by making read_csv actually call scan_csv under the hood, and then simply doing .select(columns).collect().

@cbrnr
Copy link

cbrnr commented Aug 9, 2024

Fixing this would actually make selecting and renaming columns during reading much easier, because with pl.scan_csv() I have to do:

cols = ["A", "F", "C", "E", "D", "B"]
cols_new = ["a", "f", "c", "e", "d", "b"]
df = (
    pl.scan_csv(logfile)
    .select(cols)
    .rename({k: v for k, v in zip(cols, cols_new)})
)

But this could be much simpler:

df = pl.read_csv(logfile, columns=cols, new_columns=cols_new)

Unless of course I'm missing something.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Aug 9, 2024

@cbrnr you are not missing anything; that's the intent, it's just a bit messy right now.

Furthermore, the order shouldn't matter, i.e. you should be able to do:

cols = ["D", "B", "A"]  # notice out of order
cols_new = ["d", "b", "a"]
df = pl.read_csv(logfile, columns=cols, new_columns=cols_new)

...but this is currently broken.

@cbrnr
Copy link

cbrnr commented Aug 9, 2024

Yes, I know that the column order is currently broken in pl.read_csv(), and it would be very nice if it worked! I just wasn't sure if the pl.scan_csv() way was maybe too verbose, because I have literally started using Polars yesterday (so I'm never sure if I'm not just missing the idiomatic way).

@mcrumiller
Copy link
Contributor Author

@stinodego was saying this is how it will/should work "under the hood"--meaning you could continue to use read_csv, but the underlying logic would be more consistent/reliable.

@cbrnr
Copy link

cbrnr commented Aug 9, 2024

Yes, this would be great indeed! Meanwhile, I've switched to scan_csv with the more verbose selecting and renaming, not a big deal, but I just wanted to mention another use case which would be enabled by fixing read_csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Status: Ready
5 participants