-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NERSC SFAPI Reconstruction Flow #44
Conversation
…orchestration/flows/bl832/tomography_hpc.py). Linting orchestration/nersc.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I added a number of review comments. This is a really nice start, and makes the nersc-specific code easy to see and follow. If my comments were a little picky, it's because the code was easy to follow.
tomocagrahpy_hpt.py should be split into sepearate modules, something like: tomography_hpc.py, nersc.py and alcf.py so that no more code than necessary is imported (e.g. when running an ALCF job, you don't need to import NerscClient.)
Also, what about job_controller.py
rather than tomography_hpc.py
? Since this is under flows/bl832, tomography is implied. And we might some day write the job controllers for local (non-HPC) jobs.
…ig.py, renamed tomography_hpc.py to job_controller.py. Added back nersc.py, which now just uses job_controller (need to add back in the other data transfer/pruning logic, etc). Inject NERSCClient into the nersc controller init. Removed references to export controls. Fixed typing for Config832 references (and pass into the init). Turned create_nersc_client() into a static method in the class. Moved load_dotenv to the top of the module. Consoldated reconstrut() and and build_multi_resolution() in the nersc implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
…c.py. Updated get_controller() in job_controller.py to take in an HPC enum that stores the corresponding Controller implementation for that facility.
…ath and run respective tasks. Revert to debug queue (at least for now) since the queue time is orders of magnitude short than the preempt queue (want to look into the interactive queue). Moved NERSCTomographyHPCController back into job_controller.py due to circular imports. Fixed Enum implementation.
…ows/bl832/nersc.py. Removed the ALCF controlelr from job_controller, in anticipation of that being defined in alcf.py when I refactor that code.
Now that you have a nice clean |
…it for the job to complete before moving onto the next step. An OK workaround for this: NERSC/sfapi_client#93
Sounds good to me, I will work on adding unit tests. I suspect we can replace references to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the following after discussion:
-
add a big comment to the NerscClient saying that it's deprecated and will be removed when we refactor the ptychography code
-
Remove initialization of Config832 and sfapi Client from NERSCTomogrpahyHPCController, passing them down from flows to tasks as parameters
-
Create unit tests for NERSCTomogrpahyHPCController methods
…l be removed. Removed init of COnfig832 and sfapi_client from NERSCTomographyHPCCOntroller, and instead inject them (no longer optional). Updated unit tests in
|
… with up to date python installations
I met with Raja and we came up with a few next steps to wrap this up:
|
…. Updated nersc.py to use the realtime queue, which now works with minimal delay. Updated recon/multires jobs to use pscratch for temporary file storage/writing, after copying data from the source at /global/cfs/.../8.3.2/raw/. The next steps include copying the reconstructions to the appropriate /global/cfs/.../8.3.2/scratch/ directory, copying to data832, and scheduling pruning.
…ed nersc832_alsdev_pscratch endpoint in the config, and added a nersc_prune_pool and prefect deployment.
… move data within the nersc reconstruction flow.
…ERSC flow to use those. Updated the Prefect deployment script for the NERSC implementation. Updated endpoints in config.yml to include pscratch/scratch and pscratch/raw for alsdev. Verified pruning code is working as expected.
…FTomographyHPCController in olcf.py, to be addressed in future git issues.
…le implementations. Replaced BaseEndpoint(Protocol) with a more generic TransferEndpoint(ABC). Updated FileSystemEndpoint to include the missing root_path. Fixed SimpleTransferController class and verified that it works. Updated the names of both GlobusTransferController and SimpleTransferController. Bound the generic Endpoint to TransferEndpoint. Updated inits in job_controllers and transfer_controllers to use the super().init(config) notation. Added the built-in warnings module for orchestration/nersc.py for deprecation messages. Added to_dict and from_dict methods to the GlobusEndpoint class to address potential serialization/deserialization issues (although I haven't encountered them yet with Prefect).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last tiny comment
Setting this up as a draft PR for now while I iron out implementation details to match the ALCF workflow.
These two steps are currently working via sfapi and the container Raja built (registry.nersc.gov/als/tomorecon_nersc_mpi_hdf5):
There are still at least a few things that need to be addressed before it is production ready:
new_832_file_flow
andalcf_recon_flow
expect (still using a legacy version of reconstruction.py that expects an input.txt file)