Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ON HOLD] Unit Tests for CPX mode #1319

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

Conversation

saurabhAMD
Copy link
Contributor

Details

Work item: "Internal"

What were the changes?
Enabling Unit Tests to run for CPX mode.

Why were the changes made?
Currently UT tests all possible numbers of GPUs up to 16. If same scheme is used to test CPX mode, UT will take forever to complete.

How was the outcome achieved?

  1. Detect if system is in CPX mode.
  2. If yes, only test pow2 number of GPUs.
  3. Fit GPUs into minimal number of sockets. Unique ID used to detect the OAM for the gpu. Implementation logic: GPUs mapped to sockets associated with a greater number of GPUs are given higher preference for assignment.

Additional Documentation:
None.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

Comment on lines +164 to +201
int getDeviceMode (bool *cpxMode){
// Prepare parent->child pipe
int pipefd[2];
if (pipe(pipefd) == -1)
{
ERROR("Unable to create parent->child pipe for getting the device mode\n");
return TEST_FAIL;
}
pid_t pid = fork();
if (0 == pid)
{
bool iscpxMode = false;
try {
std::string log = execCommand("rocm-smi --showcomputepartition");
bool foundCPX = log.find("CPX") != std::string::npos;
if (foundCPX) {
iscpxMode = true;
}
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}
if (write(pipefd[1], &iscpxMode, sizeof(iscpxMode)) != sizeof(iscpxMode)) return TEST_FAIL;
close(pipefd[0]);
close(pipefd[1]);
exit(EXIT_SUCCESS);
}
else {
int status;
if (read(pipefd[0], cpxMode, sizeof(*cpxMode)) != sizeof(*cpxMode)) return TEST_FAIL;
waitpid(pid, &status, 0);
assert(!status);
close(pipefd[0]);
close(pipefd[1]);
}
return TEST_SUCCESS;
return 0;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why rocm-smi is run in a fork process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we cannot use HIP call prior to launching unless it is inside another child process. You can refer to line 207 for this comment. The same procedure is used for other features such as finding the architecture of the system as in line 211.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your case, I think popen already forks and creates a pipe internally, so you may not need your own fork. If so, getDeviceMode and getDevicePriority can be simplified.

}
if(isGfx94) {
bool cpxMode = false;
getDeviceMode(&cpxMode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of getDeviceMode is never used, so if it can be simplified as I've suggested, you could rename it to isDeviceCpxMode and return bool.

@corey-derochie-amd corey-derochie-amd changed the title Unit Tests for CPX mode [ON HOLD] Unit Tests for CPX mode Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants