feat: Implement `LazyFrame.sink_ndjson` #10786

fernandocast · 2023-08-29T22:51:32Z

Implementing new sink_json method to support json format in streaming mode.

crates/polars-lazy/src/frame/mod.rs

svaningelgem · 2023-08-30T08:01:45Z

py-polars/src/lazyframe.rs

+
+        // if we don't allow threads and we have udfs trying to acquire the gil from different
+        // threads we deadlock.
+        py.allow_threads(|| {


also this here is duplicated every sink.

Can we not do something like 1 generic sink method where you need to pass in an instance of a BatchedWriter?
That way we can much more easily add more types to sink?

WDYT?

Also agree on this one, will try to refactor this code as well

After looking at the code, I don't think we can do much on this part 😞 there are only 2-3 lines of duplicated code because of py.allow_threads call, it would possible to remove the duplicated code if pyo3 implements an allow_threads attribute so the execution of the whole function releases the GIL.
It would be nice to be able to pass a BatchedWriter instance but then we will be mixing API and task execution code. I think is a better design to keep a separation of concerns that enables the API and the task's code to evolve in their own way.

I'll think about it a bit more tomorrow & come back if I would think of a certain way to handle it.
link for my own benefit: Parallelism

Thanks for looking into it. Nothing came to mind when I started to ponder about deadlocks and gil and stuff :-)

svaningelgem · 2023-09-01T04:01:50Z

py-polars/tests/unit/io/test_lazy_json.py

+    assert_frame_equal(df, expected)
+
+
+def test_sink_json_should_support_with_options(io_files_path: Path, tmp_path: Path) -> None:


Nothing is being tested in this method?

Sorry, we are still finishing the testing 😄

fernandocast · 2023-09-06T23:45:46Z

Hi everyone,

this pull request is ready, we got two failed checks nonetheless they are not in the files that we modified for this feature.

Any feedback or comments are welcome.

abealcantara · 2023-09-07T00:15:49Z

Some tests started to fail after doing the merge with main, we are going to fix them.

svaningelgem · 2023-09-08T06:22:19Z

crates/polars-io/src/json/mod.rs

+    /// # Panics
+    /// The caller must ensure the chunks in the given [`DataFrame`] are aligned.
+    fn write_batch(&mut self, df: &DataFrame) -> PolarsResult<()> {
+        let fields = df.iter().map(|s| s.field().to_arrow()).collect::<Vec<_>>();


Personally I would make a class in between that handles all the generic stuff between JsonWriter & JsonLinesBatchedWriter. In the final classes, I would just implement an on_block_write and on_finish.

My 2ct :)

Yeah, we were thinking of something similar, that's why we implemented the BatchedWriter trait in the json module but we were thinking that it would be better to expose the SinkWriter trait, that is part of the pipeline execution or maybe a struct with some common code as you mentioned, to each of the io modules, so each module implements this interface and we can remove all the implementations currently in file_sink.rs. Of course this will create an extra dependency as the IO modules will now depend of a trait of the execution core, but I think this will make sense since we will only be exposing an interface or a collections of interfaces that need to be implemented by each module. Please let me know your thoughts on this 😄

Maybe you can chime in on #11056? Because that is very much in line with what you are saying here, except on a broader scale.

Just asking ritchie for his opinion there, but with your knowledge of the framework, it might be you have more concrete ideas on how to handle it?

crates/polars-lazy/src/frame/mod.rs

svaningelgem · 2023-09-08T06:28:01Z

py-polars/src/lazyframe.rs

+
+        // if we don't allow threads and we have udfs trying to acquire the gil from different
+        // threads we deadlock.
+        py.allow_threads(|| {


Thanks for looking into it. Nothing came to mind when I started to ponder about deadlocks and gil and stuff :-)

stinodego · 2023-09-09T09:25:40Z

I'll review this more in-depth later, but I already noticed a doc entry on the Python side is missing. So writing it down now before I forget :)

fernandocast · 2023-09-12T21:12:08Z

I'll review this more in-depth later, but I already noticed a doc entry on the Python side is missing. So writing it down now before I forget :)

Hi @stinodego I hope you are doing well,
I was wondering if we can help adding this feature to Polars documentation, we appreciate if you can explain us how should be this process or perhaps share a link with us about how to contribute on documentation.

Regarding this PR, is there something else missing in order to merge it?

universalmind303 · 2023-09-26T18:04:49Z

crates/polars-io/src/json/mod.rs

Since we have the JsonFormat enum, I think it may be easier to use if the batched writers were combined & accepted a JsonFormat as an input.

Fixed it, could you check again?

universalmind303 · 2023-09-26T18:06:12Z

crates/polars-io/src/json/mod.rs

@@ -155,6 +161,101 @@ where
    }
 }

+pub trait BatchedWriter<W: Write> {


can we reuse the SinkWriter trait here instead of adding a new trait?

Since we changed the implementation to accept a JsonFormat as an input, BatchedWriter trait is not necessary anymore.

Just as a comment, It would be great if we can expose SinkWriter to be used by other crates. This way we can move all SinkWriter implementations to the corresponding modules. 😄

universalmind303 · 2023-09-26T18:08:02Z

crates/polars-io/src/json/mod.rs

+        let fields = df.iter().map(|s| s.field().to_arrow()).collect::<Vec<_>>();
+        let batches = df
+            .iter_chunks()
+            .map(|chunk| Ok(Box::new(chunk_to_struct(chunk, fields.clone())) as ArrayRef));


couldn't we use into_struct here instead?

We tried but the implementation does not seem very transparent, do you have an example?

ritchie46

Thank you for your patience @fernandocast. It is going in now. :)

fernandocast · 2023-11-20T16:13:22Z

Thank you for your patience @fernandocast. It is going in now. :)

My pleasure @ritchie46, my friend and I are convinced of the potential that Polars has.

svaningelgem reviewed Aug 30, 2023

View reviewed changes

svaningelgem reviewed Sep 1, 2023

View reviewed changes

fernandocast marked this pull request as ready for review September 5, 2023 00:41

fernandocast requested review from ritchie46, stinodego, alexander-beedie and orlp as code owners September 5, 2023 00:41

fernandocast force-pushed the feature_sink_json branch from 08279c8 to 2f40aed Compare September 5, 2023 01:06

stinodego changed the title ~~Feature sink json~~ feat: Implement LazyFrame.sink_json Sep 5, 2023

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 5, 2023

fernandocast force-pushed the feature_sink_json branch from 2f40aed to 878ef5d Compare September 6, 2023 23:53

svaningelgem reviewed Sep 8, 2023

View reviewed changes

This was referenced Sep 8, 2023

refactor(python): Add private LazyFrame method for setting sink optimizations #10988

Merged

feat(rust): ParquetCloudSink to allow streaming pipelines into remote ObjectStores #10060

Merged

abealcantara force-pushed the feature_sink_json branch from 78e6a56 to 812b453 Compare September 8, 2023 19:48

abealcantara force-pushed the feature_sink_json branch 3 times, most recently from f4746cd to d97ba76 Compare September 20, 2023 18:59

fernandocast force-pushed the feature_sink_json branch 2 times, most recently from 7fdc081 to 26328c4 Compare September 25, 2023 18:11

abealcantara force-pushed the feature_sink_json branch from 26328c4 to 2c815a8 Compare September 26, 2023 17:23

universalmind303 reviewed Sep 26, 2023

View reviewed changes

fernandocast force-pushed the feature_sink_json branch 2 times, most recently from a863625 to 2e4e179 Compare September 27, 2023 22:19

fernandocast and others added 25 commits November 18, 2023 16:54

Add API changes

f7ed446

feat: adding batched json writer

df1001e

integration and fixes

7c7da80

small fixes

8ecf929

small fix

2d46937

refactor: new json batched writers

19ca57d

add py-polars unit testing

c853e54

refactor: removing duplicated code for sink_file

dc7ed34

add unit testing for sink_json method

2311d32

style: fix code formatting

48b833a

fix: unit tests

89b5de5

style: fix code formatting

7b3d901

fix: rust lint

e1610cf

fix: feature check

4546a7d

fix: using polars_json and formatting

ee1b225

feature: add code style

6fb40ce

add json sink features

f22727e

refactor fixes

38588e3

fix: replace match with if structure, for single_match

935c58b

fix: update style with pre-commit

da07fd8

reverting changes in dependencies for ipc and csv

4c95a5f

fix: removing json support

d02b9b2

feat: refactoring sink method, removing deplicated code

622899f

fix: applying style format to mod.rs

3fc1786

add feature label

47745b4

fernandocast force-pushed the feature_sink_json branch from fee49ba to 47745b4 Compare November 18, 2023 22:54

ritchie46 approved these changes Nov 20, 2023

View reviewed changes

ritchie46 merged commit 1bcdf00 into pola-rs:main Nov 20, 2023
26 checks passed

etiennebacher mentioned this pull request Jan 8, 2024

sink_ndjson() doesn't appear in docs #13533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement `LazyFrame.sink_ndjson` #10786

feat: Implement `LazyFrame.sink_ndjson` #10786

fernandocast commented Aug 29, 2023 •

edited by stinodego

Loading

svaningelgem Aug 30, 2023

abealcantara Sep 4, 2023

abealcantara Sep 4, 2023

svaningelgem Sep 4, 2023 •

edited

Loading

svaningelgem Sep 8, 2023

svaningelgem Sep 1, 2023

abealcantara Sep 4, 2023

fernandocast commented Sep 6, 2023 •

edited

Loading

abealcantara commented Sep 7, 2023 •

edited

Loading

svaningelgem Sep 8, 2023

abealcantara Sep 11, 2023 •

edited

Loading

svaningelgem Sep 11, 2023

svaningelgem Sep 8, 2023

stinodego commented Sep 9, 2023

fernandocast commented Sep 12, 2023

universalmind303 Sep 26, 2023

fernandocast Sep 27, 2023

universalmind303 Sep 26, 2023

fernandocast Sep 27, 2023

abealcantara Sep 27, 2023

universalmind303 Sep 26, 2023

fernandocast Sep 27, 2023

ritchie46 left a comment

fernandocast commented Nov 20, 2023

		assert_frame_equal(df, expected)


		def test_sink_json_should_support_with_options(io_files_path: Path, tmp_path: Path) -> None:

feat: Implement LazyFrame.sink_ndjson #10786

feat: Implement LazyFrame.sink_ndjson #10786

Conversation

fernandocast commented Aug 29, 2023 • edited by stinodego Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svaningelgem Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fernandocast commented Sep 6, 2023 • edited Loading

abealcantara commented Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

abealcantara Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stinodego commented Sep 9, 2023

fernandocast commented Sep 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 left a comment

Choose a reason for hiding this comment

fernandocast commented Nov 20, 2023

feat: Implement `LazyFrame.sink_ndjson` #10786

feat: Implement `LazyFrame.sink_ndjson` #10786

fernandocast commented Aug 29, 2023 •

edited by stinodego

Loading

svaningelgem Sep 4, 2023 •

edited

Loading

fernandocast commented Sep 6, 2023 •

edited

Loading

abealcantara commented Sep 7, 2023 •

edited

Loading

abealcantara Sep 11, 2023 •

edited

Loading