-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved efficiency of weather-mv bq in terms of time and cost. #473
Conversation
* Commented out 'gcloud alpha storage' for copying the file (not working on Dataflow). * Fixed uncompression logic while opening file (Was failing when opening GCS bucket file). * Added logic for handling errors if rasterio is unable to open the file.
* We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a CSV file so that we do not need to process it every time we process a set of files. * Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls. * Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls. * We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop. * Introduced a flag to control how many rows to be loaded into memory for processing.
29754e7
to
65e0021
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this new approach, it’s very smart. The stats looks great, I’m impressed.
I think it may be nice to revisit vecotrization later, but the optimizations in IO will have a greater effect.
Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Very nice improvement in runtime and cost.
Stats:
Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):
Total number of rows to be ingested into BQ: 13,497,120 (1440 * 721 * 13 *1)
Ran on DataflowRunner.
machine_type: n1-standard1; 4gb RAM; 100GB HDD.
Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.
Approach:
--rows_chunk_size <chunk-size>
, users can control the number of rows loaded into memory for processing, depending on their system's memory.Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.
ps: From the learnings of ARCO-ERA5 to BQ ingestion.