Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider omitting unchunked dimensions from Key objects created with DatasetToChunks #43

Open
shoyer opened this issue May 6, 2022 · 1 comment

Comments

@shoyer
Copy link
Member

shoyer commented May 6, 2022

Currently we have (from https://xarray-beam.readthedocs.io/en/latest/read-write.html):

with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(ds, chunks={'time': 1000}) | beam.MapTuple(print_summary)
Key(offsets={'lat': 0, 'lon': 0, 'time': 0}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 1000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 2000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 920, 'lon': 53}>

Should we instead omit lat and lon from these keys? This is less explicit but also more flexible, e.g,. if replacing these dimensions entirely with different dimensions, you don't need to update the keys.

copybara-service bot pushed a commit that referenced this issue Aug 31, 2022
This allows for splitting datasets across variables even when those variables have different dimensions. See the new integration test for a concrete use-case, resembling real model output.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Aug 31, 2022
This allows for splitting datasets across variables even when those variables have different dimensions. See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Aug 31, 2022
This allows for splitting datasets across variables even when those variables have different dimensions. See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 1, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 1, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 1, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 3, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 3, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 3, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
copybara-service bot pushed a commit that referenced this issue Sep 3, 2022
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

PiperOrigin-RevId: 471347485
@shoyer shoyer reopened this Sep 3, 2022
@shoyer
Copy link
Member Author

shoyer commented Sep 3, 2022

One of my original motivations for this is obviated by #50, which now allows us to handle variables in DatasetToChunks even if they don't include "chunked" dimensions.

It's still an open question whether this change would make Xarray-Beam more usable or not.

If we do not make this change, potentially we could enforce the invariant that key.offsets.keys() == dataset.dims.keys(). This might be convenient for writing new transforms.

alxmrs pushed a commit to alxmrs/xarray-beam that referenced this issue Jan 25, 2023
There are two major internal changes:
1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes google#43

PiperOrigin-RevId: 471948735
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant