Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved efficiency of weather-mv bq in terms of time and cost. #473

Merged
merged 10 commits into from
Sep 7, 2024

Conversation

mahrsee1997
Copy link
Collaborator

@mahrsee1997 mahrsee1997 commented Sep 2, 2024

Stats:

Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):

<xarray.Dataset>
Dimensions:                  (longitude: 1440, latitude: 721, level: 13, time: 1)
Coordinates:
  * longitude                (longitude) float32 0.0 0.25 0.5 ... 359.5 359.8
  * latitude                 (latitude) float32 -90.0 -89.75 ... 89.75 90.0
  * level                    (level) int32 50 100 150 200 ... 700 850 925 1000
  * time                     (time) timedelta64[ns] 06:00:00
    datetime                 (time) datetime64[ns] ...
Data variables:
    10m_u_component_of_wind  (time, latitude, longitude) float32 ...
    10m_v_component_of_wind  (time, latitude, longitude) float32 ...
    2m_temperature           (time, latitude, longitude) float32 ...
    mean_sea_level_pressure  (time, latitude, longitude) float32 ...
    geopotential             (time, level, latitude, longitude) float32 ...
    specific_humidity        (time, level, latitude, longitude) float32 ...
    temperature              (time, level, latitude, longitude) float32 ...
    u_component_of_wind      (time, level, latitude, longitude) float32 ...
    v_component_of_wind      (time, level, latitude, longitude) float32 ...
    vertical_velocity        (time, level, latitude, longitude) float32 ...

Total number of rows to be ingested into BQ: 13,497,120 (1440 * 721 * 13 *1)

Ran on DataflowRunner.
machine_type: n1-standard1; 4gb RAM; 100GB HDD.

Branch Time Taken Cost Autoscaled max to
main 48 min $9.33 625 workers
mv-opitimization 36 min $0.10 6 workers

Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.

Approach:

  • We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a parquet file so that we do not need to process it every time we process a set of files.
  • Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls.
  • Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls.
  • We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop.
  • Using --rows_chunk_size <chunk-size>, users can control the number of rows loaded into memory for processing, depending on their system's memory.

Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.

ps: From the learnings of ARCO-ERA5 to BQ ingestion.

* Commented out 'gcloud alpha storage' for copying the file (not working on Dataflow).
* Fixed uncompression logic while opening file (Was failing when opening GCS bucket file).
* Added logic for handling errors if rasterio is unable to open the file.
* We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a CSV file so that we do not need to process it every time we process a set of files.
* Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls.
* Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls.
* We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop.
* Introduced a flag to control how many rows to be loaded into memory for processing.
@mahrsee1997 mahrsee1997 marked this pull request as ready for review September 3, 2024 07:07
Copy link
Collaborator

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this new approach, it’s very smart. The stats looks great, I’m impressed.

I think it may be nice to revisit vecotrization later, but the optimizations in IO will have a greater effect.

@mahrsee1997 mahrsee1997 requested a review from alxmrs September 4, 2024 14:40
@alxmrs
Copy link
Collaborator

alxmrs commented Sep 5, 2024

Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me.

Copy link
Collaborator

@fredzyda fredzyda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Very nice improvement in runtime and cost.

@mahrsee1997
Copy link
Collaborator Author

Thanks @alxmrs and @fredzyda for the review!

@mahrsee1997 mahrsee1997 merged commit 862d003 into main Sep 7, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants