Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip writing ts metadata since it causes timeout #464

Merged
merged 3 commits into from
Aug 13, 2024
Merged

Conversation

rajeee
Copy link
Contributor

@rajeee rajeee commented Aug 13, 2024

Pull Request Description

Dask parquet metadata files are auxiliary files that can help dask speed up read operation on a folder housing large number fo files. However, postprocessing have been crashing when trying to write this file. This PR removes this step so that postprocessing can complete successfully.

Example of crash:

INFO:2024-08-13 08:39:11:buildstockbatch.postprocessing:Gathered 70441 files. Now writing _metadata
/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/distributed/client.py:3362: UserWarning: Sending large graph of size 16.78 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
 warnings.warn(
Traceback (most recent call last):
 File "<frozen runpy>", line 198, in _run_module_as_main
 File "<frozen runpy>", line 88, in _run_code
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/hpc.py", line 978, in <module>
   main()
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/utils.py", line 120, in run_with_error_capture
   return func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/hpc.py", line 962, in main
   batch.process_results()
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/hpc.py", line 670, in process_results
   super().process_results(*args, **kwargs)
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/base.py", line 954, in process_results
   postprocessing.combine_results(fs, self.results_dir, self.cfg, do_timeseries=do_timeseries)
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/postprocessing.py", line 613, in combine_results
   write_metadata_files(fs, ts_dir, partition_columns)
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/buildstockbatch/postprocessing.py", line 392, in write_metadata_files
   create_metadata_file(concat_files, root_dir=parquet_root_dir, engine="pyarrow", fs=fs)
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 1193, in create_metadata_file
   out = out.compute(**compute_kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/dask/base.py", line 376, in compute
   (result,) = compute(self, traverse=False, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/dask/base.py", line 662, in compute
   results = schedule(dsk, keys, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/kfs2/shared-projects/buildstock/envs/dev-pwhite-20240808/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 1912, in aggregate_metadata
   meta.write_metadata_file(fil)
^^^^^^^^^^^
 File "pyarrow/_parquet.pyx", line 1073, in pyarrow._parquet.FileMetaData.write_metadata_file
 File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't serialize thrift: Internal buffer size overflow when requesting a buffer of size 4294967329

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on Checks -> BuildStockBatch Tests -> Artifacts)
  • Coverage has increased or at least not decreased. Update minimum_coverage in .github/workflows/coverage.yml as necessary.
  • All other unit and integration tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run on Kestrel/Eagle to make sure it all works if you made changes that will affect Kestrel/Eagle
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@rajeee rajeee requested a review from nmerket August 13, 2024 19:14
@rajeee rajeee merged commit fe1ba37 into develop Aug 13, 2024
6 checks passed
@rajeee rajeee deleted the skip_dask_metadata branch August 13, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants