-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shared cache: parallelize the scavenge operation #494
Comments
The (kind of) good news is that scavenge does appear to work, but not optimally when the cache is shared. When the cache is shared, all instances of |
Here is a conversation with good notes between me and @adammoody concenring the topic: [10:36 AM] McFadden, Marty [10:40 AM] McFadden, Marty [12:21 PM] Moody, Adam T. Having said that, I think we should actually look at making scr_copy an MPI job. It would be really helpful in this case. Also, systems today are more tolerable about running MPI in an allocation with failed nodes, and our node failure detection is pretty decent. I think I'd like to keep the pdsh + serial scr_copy method as a plan B in our back pocket in case there is some situation where we need it. Maybe that just means keeping a branch with the code somewhere if not actually installing it. That wouldn't help much with shared cache, but it'd be a nice fallback option for node-local storage users in case they are on a system where the above limitations show up. [12:58 PM] McFadden, Marty I wonder whether there is anything that scr_copy can read in from the pattern of filemape_X to indicate whether the .scr directory is on a shared cache? I think that we want to avoid having N copies of scr_copy each copy over the same set of files from a shared cache even when it is being run serially [1:10 PM] Moody, Adam T. [1:11 PM] McFadden, Marty [1:12 PM] McFadden, Marty [1:13 PM] McFadden, Marty [1:16 PM] Moody, Adam T. [1:16 PM] McFadden, Marty It seems like we would need to have a new (MPI) program that will build a globally viewable file map With file names and rank lists for each file or maybe that is what scr_copy+mpi would do But what you've said about MPI (sometimes) not working within a broken allocation concerns me. This is the only time that this stuff will be run right? [1:22 PM] Moody, Adam T. Something a bit more general than just listing that the cache is global, would be to describe the storage topology somehow say in the flush file or somewhere else. For example, we could list the nodes that are designated as "storage leaders", i.e., nodes that had a process that was rank=0 of its store descriptor communicator. For node local storage, we'd list every compute node. For global cache, we'd just list the first node. Then we could modify scavenge to only launch scr_copy processes on the storage leader nodes. Or taking that a step further for higher copy performance, we could spell out full storage groups, launch scr_copy on all nodes, and then divide files up among members of each storage group. Taking the long view, ultimately we want all/most/many of the surviving compute nodes to help during the copy operation in the scavenge. When writing a checkpoint, each compute node will have saved some data so that the dataset size scales as O(N), where N = number of compute nodes that were used in the job. We want the scavenge to scale similarly. For example, launching a single scr_copy on a single node would not be scalable since then we have a single compute node responsible for copying O(N) data, like 1 node copying data saved by 4000 nodes. The problem with a global cache is that we have to figure out a way to coordinate that effort [1:47 PM] McFadden, Marty If it could be, then perhaps this could be made to work with pdsh/scr_copy as each scr_copy could open a read-only copy of the flush.scr file and could do its work based upon which node it was running on Or perhaps the scr_scavenge script could create a copy of the flush.scr file that it does some post job-processing on. Assigns nodes to files and such. I say this because scr_scavenge knows which nodes are left to do any work [1:52 PM] Moody, Adam T. [1:54 PM] McFadden, Marty [2:02 PM] Moody, Adam T. [2:04 PM] McFadden, Marty I also have a silly HPC question. How does one deail with 4000 nodes if they are are uniquely named with various strings sizes and each has very few (if any) similar characters (like "Freddy2" and "Bill3")? Or is that simply not allowed and administrators would NEVER let that happen? I'm assumng that they are named in a sane way like "rzgenie01-NNN" [2:08 PM] Moody, Adam T. Another challenge that comes to mind here is what happens if there is some undetected failed node that is repsonsible for some portion of the file. We'll need a way to know that portion failed to copy. I guess we could gather and process return codes from all nodes that we launch on, or maybe we have each of those processes write some additional known data somewhere else that we can check later. LLNL uses well-formed node names that can be grouped together. However, there are HPC centers that used node names that really look like random strings (no simple way to compress them). [2:11 PM] McFadden, Marty Whoa! Really!?!? (node names). Yikes!! Do I need to worry about that now? Do our instantiations of pdsh work with 10's of 1000's of those? [2:13 PM] Moody, Adam T. [2:14 PM] McFadden, Marty [2:14 PM] Moody, Adam T. [2:15 PM] McFadden, Marty Yeah, I've run into odd names in SCR and in feedback on other projects, like: Rather than just a cluster name and number like we do, some sites encode physical location information into their node names, like row number, rack number, rack height, slot number, etc. That helps their ops team find a node just given the name, but it leads to names like "row5-rack6-slot10" |
The scavenge operation assumes files are cached in node local storage. After the run stops, the scavenge script launches the scr_copy executable on each compute node via pdsh. On each node, this executable runs as a serial process. It reads the SCR cache index in the control directory (/dev/shm) on each node to get the cache location for a specified dataset. It then scans the cache directory and copies every application file and every SCR metadata/redundancy file from cache to the prefix directory.
When cache is a global file system, this will need to be fixed. I'm not sure offhand if this will only copy files through a single compute node, or whether every compute node will try to copy every file and step on each other. Either case is bad. Ideally, we'll want to parallelize this copy operation, so that multiple compute nodes help in the copy, but they each copy a distinct set of files.
If the redundancy scheme is SINGLE, perhaps we can skip copying the redundancy data files?
Also, the application file may be a large, single file. To be efficient, we'd need to transfer that in parallel by assiging different segments to different compute nodes. Is there a way to do that with a bunch of independent scr_copy tasks or do we need to turn scavenge into an MPI job?
The text was updated successfully, but these errors were encountered: