Processing large datasets in a streaming manner #9772

OwenKephart · 2022-09-22T17:26:11Z

OwenKephart
Sep 22, 2022
Maintainer

In some cases, the data that you need to process for an asset is too large to fit into memory. In these cases, it would be useful to be able to process data chunk by chunk, with each chunk getting written to external storage before the next chunk is processed.

Answered by OwenKephart

Sep 22, 2022

While a bit convoluted, it is possible to use generators for this purpose, provided you use a custom IOManager. While you're not able to yield these chunks directly from your asset or op (as Dagster only permits yielding structured events such as AssetObservations or Outputs from the body of the op), you can return a generator, which will be consumed within the IOManager itself. A quick example of this behavior would look something like the following:

class StreamingIOManager(IOManager):

    def handle_output(self, context, obj) -> None:
        with open("my_output.txt", "wb") as f:
            for chunk in obj(): # obj is a generator function
                f.write(chunk)

    def loa…

View full answer

OwenKephart · 2022-09-22T17:46:16Z

OwenKephart
Sep 22, 2022
Maintainer Author

While a bit convoluted, it is possible to use generators for this purpose, provided you use a custom IOManager. While you're not able to yield these chunks directly from your asset or op (as Dagster only permits yielding structured events such as AssetObservations or Outputs from the body of the op), you can return a generator, which will be consumed within the IOManager itself. A quick example of this behavior would look something like the following:

class StreamingIOManager(IOManager):

    def handle_output(self, context, obj) -> None:
        with open("my_output.txt", "wb") as f:
            for chunk in obj(): # obj is a generator function
                f.write(chunk)

    def load_input(self, _: InputContext) -> Any:
        ...

@asset
def upstream_asset():
    def _my_iterator():
        for i in range(10):
            yield i
    return _my_iterator

1 reply

CarlinWilliamson Dec 22, 2023

This is great but difficult to unit test if you rely on output_for_node() (seems to return an empty iterator)
I imagine what's happening is the IO manager has already consumed all ten items from your iterator?

Any ideas?

ClaytonSmith · 2023-04-24T18:26:41Z

ClaytonSmith
Apr 24, 2023

This should get more formal support IMO

0 replies

chrisflesher · 2024-07-29T18:59:53Z

chrisflesher
Jul 29, 2024

I am using polars.LazyFrame to do this, it is more convenient than passing around generator functions:

import dagster
import polars as pl
import upath

class LazyFrameIOManager(dagster.UPathIOManager):

    extension: str = '.parquet'

    def dump_to_path(self, context: dagster.OutputContext, obj: pl.LazyFrame, path: upath.UPath):
        obj.sink_parquet(path)

    def load_from_path(self, context: dagster.InputContext, path: upath.UPath) -> pl.LazyFrame:
        return pl.scan_parquet(path)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing large datasets in a streaming manner #9772

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Processing large datasets in a streaming manner #9772

OwenKephart Sep 22, 2022 Maintainer

Replies: 3 comments · 1 reply

OwenKephart Sep 22, 2022 Maintainer Author

CarlinWilliamson Dec 22, 2023

ClaytonSmith Apr 24, 2023

chrisflesher Jul 29, 2024

OwenKephart
Sep 22, 2022
Maintainer

Replies: 3 comments 1 reply

OwenKephart
Sep 22, 2022
Maintainer Author

ClaytonSmith
Apr 24, 2023

chrisflesher
Jul 29, 2024