Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed
CanglongCl opened this issue Apr 2, 2024 · 7 comments
Closed
Labels
A-api Area: changes to the public API A-io-csv Area: reading/writing CSV files enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics

Comments

@CanglongCl
Copy link
Contributor

CanglongCl commented Apr 2, 2024

There has been various problems about those parameters. I'm willing to contribute but I think there are more things to be specified before starting working. Here is the proposal.

List of issues related

Current behaviour

schema

  • Both names and dtypes in schema will be used in order of original dict.
    • Names from headers of csv will be ignored. (will made confusing)
    • Dtypes will not be inferred.
  • If dtype is not supported, it raise an error instead of converting.

dtypes and schema

  • If schema is provided and dtypes is passed as list of dtype, dtypes will overwrite schema.
  • If schema is provided and dtypes is passed as map, dtypes will do nothing. (not expected)
  • If schema is NOT provided, schema will be inferred and dtypes will overwrite it.

new_columns

  • new_columns will replace the column names finally.
  • Ifnew_columns is provided, it will be used in dtypes.

Intention of the parameters

Users' intention of using these parameters should can be:

  • schema and dtypes:
    • Specify data type for specific column.
    • Ensure schema of output is expected.
  • columns:
    • Select some columns in csv.
      • Select by name of original columns
      • Select by index of columns
    • Select columns with specific order.
  • new_columns:
    • Rename columns.
    • Avoid ugly column name in other parameters.

Proposal

Changes in parameters

  • Rename new_columns (read_csv) to rename_columns.
    • 'new' usually refer to create new columns. 'rename' is more accurate.
  • Deprecate with_new_columns and introduce rename_columns mentioned above in scan_csv.
  • Bring columns to scan_csv
  • Deprecate schema since other parameters can replace it totally.
    • See how new parameters resolve the users' intentions below.
  • Allow using single dtype in dtypes which means it applies to all columns. (Allow read_csv to set a single dtype for all columns, or all but certain columns #13226)
  • Specify while using Mapping[str, PolarsDataType] in dtypes, key refers to name in renamed column (final DataFrame).
  • Allow using Mapping[str, str] in rename_columns.
  • Allow using Mapping[int, PolarsDataType]/Mapping[int, str] with index key in dtypes and rename_columns.
  • Allow using Callable[[str], str], Callable[[Sequence[str]], Sequence[str]] in rename_columns
    • Callable[[str], str] is convinient for some tasks like adding prefix or suffix or make names lowercased. E.g. lambda x: x + '_suffix'.
    • Callable[Sequence[str], Sequence[str]] is the original form of with_new_columns in current scan_csv. If user needs the index of columns, this will be better. E.g. lambda cols: [f'column_{i}' for i, _ in enumerate(cols)]
  • If Sequence or Mapping[int, _] is passed for dtypes or rename_columns, the index means the index of DataFrame before row index column insertion (which means if columns is provided, it will follow the order specified in columns).

So the final function signature will be like

def read_csv(
	..., 
	columns: Sequence[int] | Sequence[str] | None = None, 
	rename_columns: Sequence[str] | Callable[[str], str] | Callable[[Sequence[str]], Sequence[str]] | Mapping[str, str] | Mapping[int, str] | None = None,
	..., 
	dtypes: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | Mapping[int, PolarsDataType] | PolarsDataType | None = None
	# No `schema` now
	...
) -> pl.DataFrame: 
	...

New process pipeline

Scan period

  1. Get csv column names.

    • If has_header, csv column names will be read from first row.
    • If not has_header, csv column names will be f'column_{n+1}' (original behavior).
  2. Get the final order of columns according to columns if provided

    • If columns is provided, get the csv column index of each column
    • If columns is not provided, the order will maintain originally.

    Current information should be like:

    [
      {
        column_name: "aaa", 
        csv_col_idx: 1
      }, 
      {
        column_name: "bbb", 
        csv_col_idx: 0
      }, 
    ]
  3. Rename column names according to rename_columns

  4. Inject dtype information according to dtypes if present

    Current information should be like:

    [
      {
        column_name: "aaa", 
        csv_col_idx: 1, 
        dtype: String
      }, 
      {
        column_name: "bbb", 
        csv_col_idx: 0, 
        dtype: None
      }, 
    ]
  5. Infer dtype where dtype is None (not provided in dtypes) from csv according to csv_col_idx

    [
      {
        column_name: "aaa", 
        csv_col_idx: 1, 
        dtype: String
      }, 
      {
        column_name: "bbb", 
        csv_col_idx: 0, 
        dtype: Int32
      }, 
    ]
  6. Insert row index column if needed

Now, scan period ends and we have the schema now like

[
  {
    column_name: "aaa", 
    dtype: String
  }, 
  {
    column_name: "bbb", 
    dtype: Int32
  }, 
]

Read Period

  1. For each columns, read from csv according to csv_col_idx
  2. If need to cast dtype, cast it after reading.

Other behaviour changes

Stabilize the output schema

In order to stabilize the output schema, raise an error when column in dtypes, rename_columns or columns is not occured. E.g.

  • index out of range
  • column name not found in current stage

How new design fit users' intention

  • Specify data type for specific column.
    • by dtypes
  • Ensure schema of output is expected.
    • by specify both dtypes and columns. We don't really need schema here.
  • Select some columns in csv.
    • by columns
  • Select columns with specific order.
    • by columns
  • Rename columns.
    • by rename_columns
  • Avoid ugly column name in other parameters.
    • dtypes now use new column names after processing according to rename_columns

Other issues

Those issue can be handled together.

Related issue:

@deanm0000
Copy link
Collaborator

Just a couple comments.

I think scan_csv ought to stay trim and not have more parameters introduced to it. For example pl.scan_csv(path).select(my_columns) doesn't seem so onerous to also need pl.scan_csv(path, columns=my_columns)

I think schema shouldn't be deprecated but maybe have a check so that if schema is given then if columns or dtypes are also given then it raises an error. Maybe it already does this, I don't know offhand.

I'm skeptical of the value of scanning a BytesIO object. I mean it's in memory just read it. If it's taking up enough memory that copying it to DF form makes you go OOM then you're not going to have much memory for queries anyway, just save a tempfile and scan that. Of course, if someone wants to do it then I'm all for having more features rather than fewer but seems like really high hanging fruit.

@CanglongCl
Copy link
Contributor Author

CanglongCl commented Apr 2, 2024

I think schema shouldn't be deprecated but maybe have a check so that if schema is given then if columns or dtypes are also given then it raises an error. Maybe it already does this, I don't know offhand.

Currently, schema is just do not infer schema and use my schema and further behavior is undefined. In new implementation, we don't need it since new design of dtypes + columns + new_columns can completely replace it without any performance loss. New design also ensure output schema is stable (see stabilize the output schema section).

@CanglongCl
Copy link
Contributor Author

I think scan_csv ought to stay trim and not have more parameters introduced to it. For example pl.scan_csv(path).select(my_columns) doesn't seem so onerous to also need pl.scan_csv(path, columns=my_columns)

One advantage of adding columns is that we can select via index instead of column names. But you are right, without columns and use select we can achieve the same result. However, making read_csv and scan_csv consistent is also good for users to switch between 2 apis.

@CanglongCl
Copy link
Contributor Author

@ritchie46 Hi if you are available, could you please take a look at this proposal? I'm more than happy to help contribute.

@caniko
Copy link

caniko commented Apr 27, 2024

One advantage of adding columns is that we can select via index instead of column names. But you are right, without columns and use select we can achieve the same result. However, making read_csv and scan_csv consistent is also good for users to switch between 2 apis.

Selecting via index is sometimes the only way to select. Just to stay simple you are ruining the day of so many clients that may need this. Simplicity by omission is evil.

I would want to combine selecting columns by indices with schema, which is not (and should not) be possible with the existing lazy API.

@stinodego stinodego added this to the 1.0.0 milestone May 25, 2024
@stinodego stinodego added the accepted Ready for implementation label May 26, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 26, 2024
@stinodego stinodego moved this from Ready to Next in Backlog May 26, 2024
@stinodego stinodego added needs decision Awaiting decision by a maintainer and removed accepted Ready for implementation labels May 26, 2024
@stinodego stinodego removed this from the 1.0.0 milestone May 27, 2024
@stinodego
Copy link
Member

We want to take a serious look at these parameters and make the necessary changes, but will probably not get to it for the 1.0.0 release.

@stinodego
Copy link
Member

stinodego commented Jun 8, 2024

dtypes has been renamed to schema_overrides. This brings it in line with other read/scan methods. The intended behavior of schema + schema_overrides is documented here:
#11723 (comment)

new_columns should be ignored if schema is also provided.
If new_columns and schema_overrides are provided, schema_override should take effect after new_columns has resolved.

columns should take effect at the very end. It's like calling scan_csv and then calling .select(columns).collect() on it.

So I believe there is no need for a redesign, but we do have a whole host of bugs to fix here. There are already issues for these bugs, so I am closing this issue.

@stinodego stinodego added the reference Reference issue for recurring topics label Jun 8, 2024
@github-project-automation github-project-automation bot moved this from Next to Done in Backlog Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-io-csv Area: reading/writing CSV files enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics
Projects
Archived in project
Development

No branches or pull requests

5 participants