-
Notifications
You must be signed in to change notification settings - Fork 1.4k
digits Slurm #1435
base: master
Are you sure you want to change the base?
digits Slurm #1435
Conversation
merge with upstream master
Wow, that's a lot of code. Do you have any documentation for how to use this? Any suggestions for how to review 20K lines of code in 100 commits? |
Working on getting the documentation ready. Most of the changes revolve around the task.py file and the sub classes of task. The workload managers are detected by the cluster factory - slurm is detected by checking if slurm home is set. The workload manager can then be selected via the gui under the info tab. When this is selected the selected workload manager is stored in the cluster_factory. In the task class the run method has been edited to check if task should be run via a workload manager or just normally. If a workload manager is selected the cluster_manager will be called and then the arguments will be edited by the selected cluster_manager. Also there are some changes to make sure no tasks have missing variables for slurm. There are other changes that are to capture the variables for slurm and changes to the gui to take the commands. Unfortunately I no longer have access to a slurm cluster to continue work on this so making any changes to my current fork would not be able to be tested by me. As of my final commit all tests you have on travis where passing using slurm. Feel free ask any questions. |
That's pretty impressive! We'll try to make some time to review this. |
Great to see that you guys are going to take the time to review this. Can't wait to see it integrated into the main branch 👍🏽 |
I have found the cause of the huge number of lines changed, I have fixed this and synced my fork. The number of lines changed is now 2,000 this is still inflated due to some line ending issues which made git think I rewrote a few files. I will run our tests again to make sure it is still passing after the sync |
After the sync all tests are still passing in our slurm environment, travis is only failing dist and lint (from some files in examples) all other tests seem to be passing fine |
Updated nvidia digits to work with slurm workload manager should be easy to extend to other workload managers if someone has a system to develop with.