Refactor Python Concatenation Scripts + Add Dask Templates #347

bcaddy · 2023-10-01T17:45:09Z

`cat_slice.py`, `cat_projection.py`, and `cat_rotated_projection.py` Refactor

In a similar vein to my refactor of cat_dset_3d.py awhile ago I refactored the three scripts to be easier to use without editing and to allow people to import it into their scripts. They're now a single file that works for all three output types. I added a CLI and an internal function concat_2d_dataset that concatenates a single output time. This means that this function can be easily called in parallel with a tool like Dask when concatenating many slices. I also eliminated the intermediate arrays so that we wouldn't run out of memory when concatenating.

There are also additional options for compression, what datatype to save as, etc.

`cat_dset_3d.py` Refactor

A minor refactor here. Moved the code for concatenating a single output time into its own function (concat_3d_dataset) so that it can be called in parallel with Dask like the 2D version. Also, added arguments for compression, what datatype to save as, and which fields to skip.

`cat_particles.py` Refactor

Another refactor to bring the particles concatenation script in line with the 2D and 3D concatenation scripts

Dask Templates

Added two templates for using Dask, one for running on a single machine and one for a distributed machine. The later also includes a Slurm script that works on Andes and Crusher; presumably Frontier as well though I haven't tested that. If you want an example of Dask in practice on Cholla data see the dask-local-runner.py and dask-andes-runner.py scripts in my JSI Talk repo.

python_scripts/cat_slice.py

helenarichie · 2023-10-27T19:17:24Z

I really like these scripts--I think they're going to be very useful. I just have a few comments and questions.

I see that the arguments/general structure are slightly different between cat_dset_3d and cat_slice (e.g. having concat_3d and concat_3d_single for full-volume data vs just having concat single be the default behavior for cat_slice). Is it necessary that they aren't consistent? I think it'd be better if they were, but at least having the same names of command line arguments between the two would make these tools easier to use. I also think that having the option to specify a sequence of numbers using -s and -e arguments from the command like like you can in cat_dset_3d is functionality I'd miss in cat_slice.
Just a minor suggestion, but I like cat_slice and cat_full and then maybe eventually cat_proj for the names of these instead of cat_dset_3d. I think it makes it clearer that they all do generally the same thing, but feel free to ignore if you don't agree.

Other than that, I think this is all great.

python_scripts/cat_dset_3D.py

evaneschneider · 2023-10-27T20:16:41Z

Agreed this is a great refactor. Along the lines of Helena's comments, could you update the wiki documentation to be a little bit more clear about how to use these? For example, it is not clear from the example code in the wiki if the default behavior is to automatically loop through all the slices for a given snapshot in a directory (i.e. is the code somehow checking for the number of processes, or is that an argument?), or if you comment there referred to looping through all snapshots.

bcaddy · 2023-10-30T15:45:47Z

@helenarichie, I'll work on unifying the API and structure between the two. Good catch, I did them at separate times and didn't realize how different they are.

@evaneschneider and @helenarichie, I agree that the naming of the files is a bit confusing and is something I found confusing when I first saw them. We currently have 6 unique scripts for concatenating data files in varying states without consistency. I think I'll open an issue for unifying them all under a more common structure and API but I have a few questions:

cat_dset_3d.py IMO is unclear compared to the rest. How does cat_3d_dataset.py sound? That's more consistent with the other ones
cat.py doesn't appear to be used, can I delete it? I think that a shared library is good, we're just not using it as is.

bcaddy · 2023-10-30T19:23:32Z

@helenarichie, I think I've addressed your comments except for the naming. Could you look over it again?

bcaddy · 2023-10-30T19:31:31Z

@evaneschneider, in the wiki. Are you talking about the section at the end of the "Outputs" page? I didn't write that, I think Alwin did. His code in cat.py is importable functions like what I've written here but without the CLI that these scripts have. When I started on my version I didn't realize another importable version existed. He and I have talked offline and decided that my new scripts are probably a better option in the long run so I think next hack day I'll update the other concatenation scripts to be similar to mine, update the wiki, and remove cat.py. Does that sound good?

alwinm · 2023-10-30T23:36:52Z

Agreed this is a great refactor. Along the lines of Helena's comments, could you update the wiki documentation to be a little bit more clear about how to use these? For example, it is not clear from the example code in the wiki if the default behavior is to automatically loop through all the slices for a given snapshot in a directory (i.e. is the code somehow checking for the number of processes, or is that an argument?), or if you comment there referred to looping through all snapshots.

I wrote cat.py and that section of the Wiki so that each solo function would detect the number of processes without argument: the code detects number of processes. The while loop in the 2nd section goes through all snapshots aka output indices *.h5 assuming the output interval for that data type is 1.

evaneschneider · 2023-11-01T15:35:02Z

@evaneschneider, in the wiki. Are you talking about the section at the end of the "Outputs" page? I didn't write that, I think Alwin did. His code in cat.py is importable functions like what I've written here but without the CLI that these scripts have. When I started on my version I didn't realize another importable version existed. He and I have talked offline and decided that my new scripts are probably a better option in the long run so I think next hack day I'll update the other concatenation scripts to be similar to mine, update the wiki, and remove cat.py. Does that sound good?

That sounds fine to me. I'll merge this in once the documentation has been updated.

helenarichie · 2023-11-03T20:30:08Z

These changes all look great to me!

Adds a CLI to cat_slice.py, removes all the hardcoded variables. Also, adds a new function internally, `cat_slice` that can be imported into other scripts and used from there, including in parallel with Dask.

One template for using Dask on a single machine and one for use on a distributed system, specifically OLCF systems Andes, Crusher, and Frontier.

The two scripts now have nearly identical CLI and structure

delete cat_projection.py as it has been superseded by concat_2d_data.py

Also, removed cat_rotated_projection.py as it is now superseded by the exhanced funcionality of concat_2d_data.py

All the functionality of cat.py is now available in concat_2d_data.py and concat_3d_data.py. Marked concatenation files as executable

…version

bcaddy · 2023-11-13T21:10:00Z

This is ready to review. There have been some significant updates since the last review so, if you have time, I'd like @helenarichie to review it again. Brief summary of changes below

Add support for projections and rotated projections to slice concatenation script
Rename scripts to concat_2d_data.py and concat_3d_data.py for clarity and consistency
Move a lot of common functions into separate file (/concat_internals.py)
Update the concatenation script for particles with new method
Remove deprecated cat.py file
Update documentation in wiki (@evaneschneider, this is live, could you take a look?)

python_scripts/concat_particles.py

helenarichie

This all looks fantastic! I noticed that part of a doc-string might be incomplete, but, otherwise, this is ready to go. I took a look at the documentation as well and I think it's clear and has everything it needs.

bcaddy force-pushed the dev-catSliceRefactor branch from eaeac71 to 4ede3a2 Compare October 12, 2023 13:27

evaneschneider requested a review from helenarichie October 12, 2023 15:02

bcaddy force-pushed the dev-catSliceRefactor branch from 4ede3a2 to 266a749 Compare October 17, 2023 15:52

helenarichie reviewed Oct 27, 2023

View reviewed changes

python_scripts/cat_slice.py Outdated Show resolved Hide resolved

helenarichie reviewed Oct 27, 2023

View reviewed changes

python_scripts/cat_dset_3D.py Outdated Show resolved Hide resolved

bcaddy force-pushed the dev-catSliceRefactor branch from 1fa2342 to 7b25de4 Compare October 30, 2023 19:22

bcaddy requested a review from helenarichie October 30, 2023 19:22

bcaddy force-pushed the dev-catSliceRefactor branch 3 times, most recently from 01b12c2 to 2b699dd Compare October 30, 2023 20:07

bcaddy mentioned this pull request Oct 30, 2023

Refactor concatenation scripts to make them importable and to give them a CLI #351

Closed

7 tasks

helenarichie approved these changes Nov 3, 2023

View reviewed changes

bcaddy marked this pull request as draft November 9, 2023 03:21

bcaddy added 9 commits November 13, 2023 14:02

Refactor cat_dset_3d.py for easier parallelization

c373116

Refactor cat_slice.py to be more flexible + a CLI

47f9430

Adds a CLI to cat_slice.py, removes all the hardcoded variables. Also, adds a new function internally, `cat_slice` that can be imported into other scripts and used from there, including in parallel with Dask.

Add python templates for using Dask

ecc13ae

One template for using Dask on a single machine and one for use on a distributed system, specifically OLCF systems Andes, Crusher, and Frontier.

Refactor slice & dset_3d scripts with common structure

6120ad3

The two scripts now have nearly identical CLI and structure

Add chunking option to concatenation scripts

ebd9ef0

Update cat_slice to work with projection data

bb8c39a

Rename concat scripts for clarity

47df926

delete cat_projection.py as it has been superseded by concat_2d_data.py

Add rotated projection support to concat_2d_data.py

ee587e9

Also, removed cat_rotated_projection.py as it is now superseded by the exhanced funcionality of concat_2d_data.py

Add safer method of opening destination HDF5 file

3a89830

bcaddy added 6 commits November 13, 2023 14:02

Convert to Numpy docstrings

da1c672

Move all concat common tools into their own file

3fbd108

Remove deprecated cat.py

572ac72

All the functionality of cat.py is now available in concat_2d_data.py and concat_3d_data.py. Marked concatenation files as executable

Rename concat_3d_output to concat_3d_dataset for consistency with 2d …

8152145

…version

Update particles concatenation file to new method

611eb16

Update python_scripts/README.md for new concat scripts

9b751a5

bcaddy force-pushed the dev-catSliceRefactor branch from d89149f to 9b751a5 Compare November 13, 2023 21:03

bcaddy marked this pull request as ready for review November 13, 2023 21:03

helenarichie reviewed Nov 16, 2023

View reviewed changes

python_scripts/concat_particles.py Outdated Show resolved Hide resolved

helenarichie approved these changes Nov 16, 2023

View reviewed changes

Fix placeholder comment

c1f6d9e

helenarichie approved these changes Nov 27, 2023

View reviewed changes

evaneschneider merged commit c14589d into cholla-hydro:dev Nov 27, 2023
9 checks passed

bcaddy deleted the dev-catSliceRefactor branch November 30, 2023 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Python Concatenation Scripts + Add Dask Templates #347

Refactor Python Concatenation Scripts + Add Dask Templates #347

bcaddy commented Oct 1, 2023 •

edited

Loading

helenarichie commented Oct 27, 2023 •

edited

Loading

evaneschneider commented Oct 27, 2023

bcaddy commented Oct 30, 2023

bcaddy commented Oct 30, 2023

bcaddy commented Oct 30, 2023 •

edited

Loading

alwinm commented Oct 30, 2023

evaneschneider commented Nov 1, 2023

helenarichie commented Nov 3, 2023

bcaddy commented Nov 13, 2023

helenarichie left a comment •

edited

Loading

Refactor Python Concatenation Scripts + Add Dask Templates #347

Refactor Python Concatenation Scripts + Add Dask Templates #347

Conversation

bcaddy commented Oct 1, 2023 • edited Loading

cat_slice.py, cat_projection.py, and cat_rotated_projection.py Refactor

cat_dset_3d.py Refactor

cat_particles.py Refactor

Dask Templates

helenarichie commented Oct 27, 2023 • edited Loading

evaneschneider commented Oct 27, 2023

bcaddy commented Oct 30, 2023

bcaddy commented Oct 30, 2023

bcaddy commented Oct 30, 2023 • edited Loading

alwinm commented Oct 30, 2023

evaneschneider commented Nov 1, 2023

helenarichie commented Nov 3, 2023

bcaddy commented Nov 13, 2023

helenarichie left a comment • edited Loading

Choose a reason for hiding this comment

bcaddy commented Oct 1, 2023 •

edited

Loading

`cat_slice.py`, `cat_projection.py`, and `cat_rotated_projection.py` Refactor

`cat_dset_3d.py` Refactor

`cat_particles.py` Refactor

helenarichie commented Oct 27, 2023 •

edited

Loading

bcaddy commented Oct 30, 2023 •

edited

Loading

helenarichie left a comment •

edited

Loading