-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create test cluster with H100 #922
Comments
@tssala23 is the plan to create a whole new cluster for testing or to just use one of our already standing clusters to test? |
@joachimweyl I guess either could work. Currently cluster are being used by people for stuff so I could opt to create a new one. Unless we added one to the test cluster but I am not able to add and remove nodes from that, just Justin from my understanding. |
If all the other test clusters are in use then we might as well spin up a new one. |
How long are you thinking this cluster will take to spin up? |
Not long, same process as all other clusters. |
@tssala23 please research which ones are in use for what purposes and if there is one available please use that and update this issue to add the H100s to that cluster. Otherwise please continue with the process of spinning up a new "Dublin" or "Damascus"? |
Albany - Isaiah was using for GPU Metrics and performance stuff |
This is the spreadsheet where we track the testbed useage. @dystewart should be keeping it updated so you can track @joachimweyl along with me and others. ;-) It is not yet clear to me how the H100s will be allocated to clusters. I believe Wayne said he wanted some for OpenStack. In addition to what Taj is describing here, which I believe would be our first attempt to run H100s with our existing production versions of OCP/RHOAI software, we would also want to try them with the newest versions of both software, which is ideally happening in the second production cluster, but if not that, then we will need a test cluster for that. Jason, Ahmed and Sanjay will also need access to H100s. They will likely be bare metal (at least at first) and not clusters. We will want at least two of the first H100s to go to them for this testing, and they will need to use the new switches for these tests with ROCE. We will also need to test the new ACM and observability software on some cluster with updated software and H100s as well. This testing might be able to co-exist with one of the other test clusters if planned carefully (that is what the existing spreadsheet assumed.) If there is a second production cluster approved (soon), then we will need to forecast H100s for that too. |
@hpdempsey @joachimweyl |
Motivation
We want to test that H100s are fully operational in our existing OCP setup
Completion Criteria
Ensure we are able to run workloads with and without OAI
Description
Completion dates
Desired - 20YY-MM-DD
Required - TBD
The text was updated successfully, but these errors were encountered: