Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ergonomic improvements to concatenation scripts #403

Merged
merged 4 commits into from
Jul 8, 2024

Conversation

mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Jul 3, 2024

This introduces a few changes to the concatenation scripts. These are all surface-level changes (they don't affect the output format -- that is still something I want to address).

There are essentially 3 changes:

  1. first, I introduced a new script called concat.py that can be used to concatenate more than one kind of dataset at a time (3D-cubes or multiple kinds of 2D datasets).
  2. next, I replaced --concat-outputs with --snaps (this is a subjective improvement)
  • The main differenece has to do with specifying ranges of snapshots with --snaps is of the form START:STOP[:STEP] (just like a python slice). For comparison, you use would specify a range with FIRST-LAST with --concat-outputs (LAST is included in the range)
  • You can still use --concat-outputs, but you get a deprecation warning
  • I definitely want to hear if people dislike this approach
  1. finally, I made it possible to automatically detect the number of files that need to be concatenated. You can still specify --num-processes, but you get a warning about it.

I also haven't really touched the particle-concatentation script. Based on how code is shared, change number 2 will affect that script. But propagating change 3 would take some additional work.

I'm definitely open to any kind of feedback!

In a subsequent PR, it's my intention to modify things so that we can optionally use multiprocessing or mpi4py to speed up concatenation (I have an idea on how to do that in a minimally invasive manner that won't affect hypothetical dask-usage).

@evaneschneider
Copy link
Collaborator

This all seems fine to me, but I'd like @helenarichie or @bcaddy to take a look, since I think they are the main ones who have been using the scripts in their existing form.

@helenarichie
Copy link
Collaborator

helenarichie commented Jul 3, 2024

I think this looks good! I like having the default behavior be that you don't have to specify the number of processes. It might be nice to also do that with the snapshot numbers at some point. I also don't have any issues with changing the way the snapshot numbers are specified.

I am having a little bit of trouble figuring out which arguments are required/what the default behavior is and was wondering if you could clarify. For example, if I run this and my input directory contains 2D and 3D snapshots will it just go ahead and concatenate all of them unless I tell it to omit certain types of output? Sorry if I'm missing something obvious!

It would also probably be good to update the documentation with these changes.

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Jul 3, 2024

I think this looks good! I like having the default behavior be that you don't have to specify the number of processes.

And it is less error-prone!

It might be nice to also do that with the snapshot numbers at some point.

I think that could definitely be nice! One of my goals here is achieving more sensible default behavior with fewer arguments (along those lines, it could be cool to do something along those lines with the -s or -o flag)

My next primary concern is introducing some level of parallelism.

I am having a little bit of trouble figuring out which arguments are required/what the default behavior is and was wondering if you could clarify. For example, if I run this and my input directory contains 2D and 3D snapshots will it just go ahead and concatenate all of them unless I tell it to omit certain types of output? Sorry if I'm missing something obvious!

The behavior of concat_3d_data.py and concat_2d_data.py are unchanged in this regard.

For concat.py you need to opt in for each type of output with the --kind parameter. Consequently,

  • --kind 3D just does 3D data
  • --kind slice just does slices (--kind slice-xy,xz will omit data from the yz datasets).
  • --kind proj just does axis-aligned projections
  • --kind rot_proj just does rotated projections
  • --kind 3D slice proj rot_proj would do all of the above

Once we add support for concatenating particles, I might support --all-kinds to handle all types of supported datasets)

It would also probably be good to update the documentation with these changes.

I'm happy to do that. I might hold off until I have your approval/this is merged.

@bcaddy
Copy link
Collaborator

bcaddy commented Jul 3, 2024

This is great. I'm totally for it

@helenarichie
Copy link
Collaborator

Ah, I see. I had gotten myself a little confused before... I appreciate the clarification!

I think this is good to go!

@evaneschneider
Copy link
Collaborator

I'll go ahead and merge this, @mabruzzo feel free to update the wiki when you get a chance.

@evaneschneider evaneschneider merged commit 43f381b into cholla-hydro:dev Jul 8, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants