Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create test cluster with H100 #922

Open
3 tasks
tssala23 opened this issue Feb 5, 2025 · 9 comments
Open
3 tasks

Create test cluster with H100 #922

tssala23 opened this issue Feb 5, 2025 · 9 comments
Assignees

Comments

@tssala23
Copy link

tssala23 commented Feb 5, 2025

Motivation

We want to test that H100s are fully operational in our existing OCP setup

Completion Criteria

Ensure we are able to run workloads with and without OAI

Description

  • Create cluster using ESI nodes including new H100 nodes
  • configure cluster
  • Run openshift tests on cluster

Completion dates

Desired - 20YY-MM-DD
Required - TBD

@tssala23 tssala23 self-assigned this Feb 5, 2025
@joachimweyl
Copy link
Contributor

@tssala23 is the plan to create a whole new cluster for testing or to just use one of our already standing clusters to test?

@tssala23
Copy link
Author

@joachimweyl I guess either could work. Currently cluster are being used by people for stuff so I could opt to create a new one. Unless we added one to the test cluster but I am not able to add and remove nodes from that, just Justin from my understanding.

@joachimweyl
Copy link
Contributor

If all the other test clusters are in use then we might as well spin up a new one.

@joachimweyl
Copy link
Contributor

How long are you thinking this cluster will take to spin up?

@tssala23
Copy link
Author

Not long, same process as all other clusters.
But thinking about it, will probably just be easier to attach it to one of the cluster built with esi, like Albany or Barcelona. There's also Cairo that Dylan is building. That way we don't have to worry about DNS records

@joachimweyl
Copy link
Contributor

@tssala23 please research which ones are in use for what purposes and if there is one available please use that and update this issue to add the H100s to that cluster. Otherwise please continue with the process of spinning up a new "Dublin" or "Damascus"?

@tssala23
Copy link
Author

Albany - Isaiah was using for GPU Metrics and performance stuff
Barcelona - Danni for distributed instruct lab
Cairo - Waiting for DNS but I dont think it has been assigned.
So thinking about it, it is likely we wont need to create a new one, as I dont think Cairo has a purpose yet, but I'm sure I could also share one of the other two, though I do like the name "Damascus".
I will check with Danni and Isaiah about their usage.

@hpdempsey
Copy link

This is the spreadsheet where we track the testbed useage.  @dystewart should be keeping it updated so you can track @joachimweyl along with me and others.   ;-)

It is not yet clear to me how the H100s will be allocated to clusters.   I believe Wayne said he wanted some for OpenStack.  In addition to what Taj is describing here, which I believe would be our first attempt to run H100s with our existing production versions of OCP/RHOAI software, we would also want to try them with the newest versions of both software, which is ideally happening in the second production cluster, but if not that, then we will need a test cluster for that.

Jason, Ahmed and Sanjay will also need access to H100s.  They will likely be bare metal (at least at first) and not clusters.  We will want at least two of the first H100s to go to them for this testing, and they will need to use the new switches for these tests with ROCE. We will also need to test the new ACM and observability software on some cluster with updated software and H100s as well. This testing might be able to co-exist with one of the other test clusters if planned carefully (that is what the existing spreadsheet assumed.)

If there is a second production cluster approved (soon), then we will need to forecast H100s for that too.
Since we don't know yet how many H100s will survive burn-in (do we?), I don't know how many we get to allocate where and how Hakan intends to network them for the various opportunities, given the limited number of switches.
Many questions without answers yet, but I just wanted to raise them here.

@tssala23
Copy link
Author

@hpdempsey @joachimweyl
Isaiah is still using Albany, and Danni is somewhat finished with Barcelona (she would like to keep the cluster up so she can reference it's current configuration but said she could also save this locally)
With that being said I think what Dylan and I do with the H100s can co exists with "test the new ACM and observability software on some cluster with updated software and H100s as well"
We will definitely need to create a new test cluster (if the new production one doesn't exist) as none of the current ones are running the latest software (we could also upgrade one but that is probably more hassle than it's worth). So that could be either tearing down barcelona and re-creating it, which is nice because we can use the same DNS records and storage allocation, or creating a completely new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants