Improved efficiency of weather-mv bq in terms of time and cost. #473

mahrsee1997 · 2024-09-02T17:02:33Z

Stats:

Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):

<xarray.Dataset>
Dimensions:                  (longitude: 1440, latitude: 721, level: 13, time: 1)
Coordinates:
  * longitude                (longitude) float32 0.0 0.25 0.5 ... 359.5 359.8
  * latitude                 (latitude) float32 -90.0 -89.75 ... 89.75 90.0
  * level                    (level) int32 50 100 150 200 ... 700 850 925 1000
  * time                     (time) timedelta64[ns] 06:00:00
    datetime                 (time) datetime64[ns] ...
Data variables:
    10m_u_component_of_wind  (time, latitude, longitude) float32 ...
    10m_v_component_of_wind  (time, latitude, longitude) float32 ...
    2m_temperature           (time, latitude, longitude) float32 ...
    mean_sea_level_pressure  (time, latitude, longitude) float32 ...
    geopotential             (time, level, latitude, longitude) float32 ...
    specific_humidity        (time, level, latitude, longitude) float32 ...
    temperature              (time, level, latitude, longitude) float32 ...
    u_component_of_wind      (time, level, latitude, longitude) float32 ...
    v_component_of_wind      (time, level, latitude, longitude) float32 ...
    vertical_velocity        (time, level, latitude, longitude) float32 ...

Total number of rows to be ingested into BQ: 13,497,120 (1440 * 721 * 13 *1)

Ran on DataflowRunner.
machine_type: n1-standard1; 4gb RAM; 100GB HDD.

Branch	Time Taken	Cost	Autoscaled max to
main	48 min	$9.33	625 workers
mv-opitimization	36 min	$0.10	6 workers

Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.

Approach:

We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a parquet file so that we do not need to process it every time we process a set of files.
Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls.
Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls.
We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop.
Using --rows_chunk_size <chunk-size>, users can control the number of rows loaded into memory for processing, depending on their system's memory.

Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.

ps: From the learnings of ARCO-ERA5 to BQ ingestion.

* Commented out 'gcloud alpha storage' for copying the file (not working on Dataflow). * Fixed uncompression logic while opening file (Was failing when opening GCS bucket file). * Added logic for handling errors if rasterio is unable to open the file.

* We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a CSV file so that we do not need to process it every time we process a set of files. * Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls. * Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls. * We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop. * Introduced a flag to control how many rows to be loaded into memory for processing.

weather_mv/loader_pipeline/bq.py

weather_mv/loader_pipeline/sinks.py

alxmrs

I really like this new approach, it’s very smart. The stats looks great, I’m impressed.

I think it may be nice to revisit vecotrization later, but the optimizations in IO will have a greater effect.

weather_mv/loader_pipeline/bq.py

alxmrs · 2024-09-05T13:04:02Z

Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me.

fredzyda

Looks good. Very nice improvement in runtime and cost.

mahrsee1997 · 2024-09-07T06:10:04Z

Thanks @alxmrs and @fredzyda for the review!

mahrsee1997 added 5 commits September 2, 2024 11:41

Fix file opening logic.

343f58f

* Commented out 'gcloud alpha storage' for copying the file (not working on Dataflow). * Fixed uncompression logic while opening file (Was failing when opening GCS bucket file). * Added logic for handling errors if rasterio is unable to open the file.

Updated test cases.

7baec6f

Updated readme.md.

6cb7283

Logged total number of chunks while processing rows.

65e0021

mahrsee1997 force-pushed the mv-opitimization branch from 29754e7 to 65e0021 Compare September 2, 2024 17:59

mahrsee1997 marked this pull request as ready for review September 3, 2024 07:07

Merge branch 'main' into mv-opitimization

ef5a133

mahrsee1997 requested review from alxmrs, DarshanSP19, deepgabani8 and fredzyda September 3, 2024 17:40