Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): add encoding parameter to write_csv #10326

Closed
wants to merge 1 commit into from

Conversation

borchero
Copy link
Contributor

@borchero borchero commented Aug 7, 2023

Motivation

In some circumstances, a CSV file should have an encoding other than utf-8. An example use case is a CSV file that is used to bulk-insert data into SQL Server where a more efficient single-byte encoding (e.g. Windows-1252) allows for smaller columns.

Changes

  • Add a new encoding parameter to write_csv which allows for all values that encoding_rs supports
  • Add a new TranscodingWriter which implements the Write trait and allows to transcode a UTF-8 stream into another character encoding
  • Add a simple test case to the Python tests

Benchmark

Encoding to anything but UTF-8 is rather slow. Transcoding itself incurs a performance penalty of ~50% (i.e. if explicitly transcoding from UTF-8 to UTF-8). Encoding to Windows-1252 takes 8x longer than encoding to UTF-8.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Aug 7, 2023
@ritchie46
Copy link
Member

But we don't support encodings in the reader. This means we could only write, but not round trip.

@orlp
Copy link
Collaborator

orlp commented Aug 7, 2023

I think this adds complexity we don't really need or want. We store everything as UTF-8 internally, so encoding to anything else is going to incur a re-encoding step regardless. Since this is the case there isn't really a performance argument as to why this should happen inside Polars instead of it being done by the caller. As @ritchie46 mentions this also opens up a can of worms for round-tripping, and I think we really don't want to get into the minefield of input encodings.

@borchero
Copy link
Contributor Author

borchero commented Aug 7, 2023

This means we could only write, but not round trip.

This is true, but would also be sufficient for the motivation I outlined above (not to say that this is desirable...)

I think we really don't want to get into the minefield of input encodings

Maybe I'm not familiar enough with encoding but can't we just build a streaming reader just like the writer is implemented in this PR? 😄

Since this is the case there isn't really a performance argument as to why this should happen inside Polars

I'm unsure how to prevent having the entire CSV file in memory when the caller has to take care of it. Any idea? Happy to do that instead if I don't double my memory usage 👀

@orlp
Copy link
Collaborator

orlp commented Aug 7, 2023

I'm unsure how to prevent having the entire CSV file in memory when the caller has to take care of it. Any idea? Happy to do that instead if I don't double my memory usage 👀

In Rust, you create a Writer that does the re-encoding on-the-fly, and pass that as the inner writer for CsvWriter.

In Python you pass an object with a write(b) method (as well as dummy read and seek methods) as the first argument to write_csv that does the re-encoding on the fly as well.

In both cases you need to be careful around UTF-8 codepoint boundaries, the buffer with which write gets called might end in the middle of a codepoint.

@borchero
Copy link
Contributor Author

borchero commented Aug 8, 2023

All right, we'll implement it on the client-side, let's see how fast that is 😬

@borchero borchero closed this Aug 8, 2023
@borchero borchero deleted the csv-encoding branch August 8, 2023 14:09
@aisayko
Copy link

aisayko commented Aug 11, 2023

#9433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants