-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Re-design columns
, new_columns
, schema
, dtypes
in read_csv
#15431
Comments
Just a couple comments. I think I think I'm skeptical of the value of scanning a BytesIO object. I mean it's in memory just read it. If it's taking up enough memory that copying it to DF form makes you go OOM then you're not going to have much memory for queries anyway, just save a tempfile and scan that. Of course, if someone wants to do it then I'm all for having more features rather than fewer but seems like really high hanging fruit. |
Currently, |
One advantage of adding |
@ritchie46 Hi if you are available, could you please take a look at this proposal? I'm more than happy to help contribute. |
Selecting via index is sometimes the only way to select. Just to stay simple you are ruining the day of so many clients that may need this. Simplicity by omission is evil. I would want to combine selecting columns by indices with |
We want to take a serious look at these parameters and make the necessary changes, but will probably not get to it for the 1.0.0 release. |
So I believe there is no need for a redesign, but we do have a whole host of bugs to fix here. There are already issues for these bugs, so I am closing this issue. |
There has been various problems about those parameters. I'm willing to contribute but I think there are more things to be specified before starting working. Here is the proposal.
List of issues related
Returned DataFrame's column order did not follow the
columns
parameter.read_csv
does not return columns in the order specified bycolumns
parameter #13066Interaction between
dtypes
andscehma
is undefined and confusing.read_csv
:dtypes
not working and very confusing #14385dtypes
contains columns not in the schema #15605Whether
columns
ornew_columns
are used in thedtypes
is not specified.columns
ornew_columns
are used in theschema
anddtypes
parameters in `read_csv #13764scan_csv
ignoresnew_columns
anddtypes
arguments whenTime
type is used indtypes
#11535read_csv
schema
argument doesn't work withnew_columns
#11186Consistency around the behavior of the schema argument across the API:
read_csv
andscan_csv
schema
argument across the API #11723Lack of clear instruction of
schema
leading confusion.read_csv
fails onschema
argument whencolumns
is also provided #14227Current behaviour
schema
schema
will be used in order of original dict.dtype
is not supported, it raise an error instead of converting.dtypes
andschema
schema
is provided anddtypes
is passed aslist
ofdtype
,dtypes
will overwriteschema
.schema
is provided anddtypes
is passed asmap
,dtypes
will do nothing. (not expected)schema
is NOT provided,schema
will be inferred anddtypes
will overwrite it.new_columns
new_columns
will replace the column names finally.new_columns
is provided, it will be used indtypes
.Intention of the parameters
Users' intention of using these parameters should can be:
schema
anddtypes
:schema
of output is expected.columns
:new_columns
:Proposal
Changes in parameters
new_columns
(read_csv
) torename_columns
.with_new_columns
and introducerename_columns
mentioned above inscan_csv
.columns
toscan_csv
columns
will interact withrename_columns
anddtypes
, I think we should add it.schema
since other parameters can replace it totally.dtypes
which means it applies to all columns. (Allow read_csv to set a single dtype for all columns, or all but certain columns #13226)Mapping[str, PolarsDataType]
indtypes
, key refers to name in renamed column (final DataFrame).Mapping[str, str]
inrename_columns
.Mapping[int, PolarsDataType]
/Mapping[int, str]
with index key indtypes
andrename_columns
.Callable[[str], str]
,Callable[[Sequence[str]], Sequence[str]]
inrename_columns
Callable[[str], str]
is convinient for some tasks like adding prefix or suffix or make names lowercased. E.g.lambda x: x + '_suffix'
.Callable[Sequence[str], Sequence[str]]
is the original form ofwith_new_columns
in currentscan_csv
. If user needs the index of columns, this will be better. E.g.lambda cols: [f'column_{i}' for i, _ in enumerate(cols)]
Sequence
orMapping[int, _]
is passed fordtypes
orrename_columns
, the index means the index ofDataFrame
before row index column insertion (which means ifcolumns
is provided, it will follow the order specified incolumns
).So the final function signature will be like
New process pipeline
Scan period
Get csv column names.
has_header
, csv column names will be read from first row.has_header
, csv column names will bef'column_{n+1}'
(original behavior).Get the final order of columns according to
columns
if providedcolumns
is provided, get the csv column index of each columnCurrent information should be like:
Rename column names according to
rename_columns
Inject dtype information according to
dtypes
if presentCurrent information should be like:
Infer dtype where
dtype
isNone
(not provided indtypes
) from csv according tocsv_col_idx
Insert row index column if needed
Now, scan period ends and we have the schema now like
Read Period
csv_col_idx
Other behaviour changes
Stabilize the output schema
In order to stabilize the output schema, raise an error when column in
dtypes
,rename_columns
orcolumns
is not occured. E.g.How new design fit users' intention
dtypes
schema
of output is expected.dtypes
andcolumns
. We don't really needschema
here.columns
columns
rename_columns
dtypes
now use new column names after processing according torename_columns
Other issues
Those issue can be handled together.
read_csv_batched
skips columns and ignoresbatch_size
ifdtypes
partially providedread_csv_batched
skips columns and ignoresbatch_size
ifdtypes
partially provided #9056pl.read_csv_batched()
fails ifdtypes
is provided and not all columns are used.pl.read_csv_batched()
fails ifdtypes
is provided and not all columns are used #9654scan_csv
does not raise when schema length does not match data.Unify
read_csv
andscan_csv
functionsread
andscan
functions #13040scan_csv
read_csv
scan_csv
#7287columns
parameterBytesIO
/StringIO
as inputscan_csv
#4950, Support BytesIO, StringIO etc. in scan_csv() #12617read_csv
#10706with_column_names
:Callable[[list[str]], list[str]]
new_columns
:Sequence[str]
with_column_names
raise error while working withrow_count_name
inscan_csv
Allow multiple positional arguments for
pl.scan_csv()
pl.scan_csv()
#12622Allow using single dtype in
dtypes
which means all columns should be this dtype. (pyo3_runtime.PanicException: python function failed: PyErr { type: <class 'TypeError'>, value: TypeError("'list' object is not callable"), traceback: None } #15484
Slice for the first rows is slow for CSV file with hundred columns and millions rows #11157
Related issue:
df.assert_schema(expected_schema)
The text was updated successfully, but these errors were encountered: