-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Python Concatenation Scripts + Add Dask Templates #347
Refactor Python Concatenation Scripts + Add Dask Templates #347
Conversation
eaeac71
to
4ede3a2
Compare
4ede3a2
to
266a749
Compare
I really like these scripts--I think they're going to be very useful. I just have a few comments and questions.
Other than that, I think this is all great. |
Agreed this is a great refactor. Along the lines of Helena's comments, could you update the wiki documentation to be a little bit more clear about how to use these? For example, it is not clear from the example code in the wiki if the default behavior is to automatically loop through all the slices for a given snapshot in a directory (i.e. is the code somehow checking for the number of processes, or is that an argument?), or if you comment there referred to looping through all snapshots. |
@helenarichie, I'll work on unifying the API and structure between the two. Good catch, I did them at separate times and didn't realize how different they are. @evaneschneider and @helenarichie, I agree that the naming of the files is a bit confusing and is something I found confusing when I first saw them. We currently have 6 unique scripts for concatenating data files in varying states without consistency. I think I'll open an issue for unifying them all under a more common structure and API but I have a few questions:
|
1fa2342
to
7b25de4
Compare
@helenarichie, I think I've addressed your comments except for the naming. Could you look over it again? |
@evaneschneider, in the wiki. Are you talking about the section at the end of the "Outputs" page? I didn't write that, I think Alwin did. His code in |
01b12c2
to
2b699dd
Compare
I wrote cat.py and that section of the Wiki so that each solo function would detect the number of processes without argument: the code detects number of processes. The while loop in the 2nd section goes through all snapshots aka output indices *.h5 assuming the output interval for that data type is 1. |
That sounds fine to me. I'll merge this in once the documentation has been updated. |
These changes all look great to me! |
Adds a CLI to cat_slice.py, removes all the hardcoded variables. Also, adds a new function internally, `cat_slice` that can be imported into other scripts and used from there, including in parallel with Dask.
One template for using Dask on a single machine and one for use on a distributed system, specifically OLCF systems Andes, Crusher, and Frontier.
The two scripts now have nearly identical CLI and structure
delete cat_projection.py as it has been superseded by concat_2d_data.py
Also, removed cat_rotated_projection.py as it is now superseded by the exhanced funcionality of concat_2d_data.py
All the functionality of cat.py is now available in concat_2d_data.py and concat_3d_data.py. Marked concatenation files as executable
d89149f
to
9b751a5
Compare
This is ready to review. There have been some significant updates since the last review so, if you have time, I'd like @helenarichie to review it again. Brief summary of changes below
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fantastic! I noticed that part of a doc-string might be incomplete, but, otherwise, this is ready to go. I took a look at the documentation as well and I think it's clear and has everything it needs.
cat_slice.py
,cat_projection.py
, andcat_rotated_projection.py
RefactorIn a similar vein to my refactor of
cat_dset_3d.py
awhile ago I refactored the three scripts to be easier to use without editing and to allow people to import it into their scripts. They're now a single file that works for all three output types. I added a CLI and an internal functionconcat_2d_dataset
that concatenates a single output time. This means that this function can be easily called in parallel with a tool like Dask when concatenating many slices. I also eliminated the intermediate arrays so that we wouldn't run out of memory when concatenating.There are also additional options for compression, what datatype to save as, etc.
cat_dset_3d.py
RefactorA minor refactor here. Moved the code for concatenating a single output time into its own function (
concat_3d_dataset
) so that it can be called in parallel with Dask like the 2D version. Also, added arguments for compression, what datatype to save as, and which fields to skip.cat_particles.py
RefactorAnother refactor to bring the particles concatenation script in line with the 2D and 3D concatenation scripts
Dask Templates
Added two templates for using Dask, one for running on a single machine and one for a distributed machine. The later also includes a Slurm script that works on Andes and Crusher; presumably Frontier as well though I haven't tested that. If you want an example of Dask in practice on Cholla data see the
dask-local-runner.py
anddask-andes-runner.py
scripts in my JSI Talk repo.