Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CL/HIER: Add allgatherv #1050

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Dec 2, 2024

This PR adds CL/HIER allgatherv and updates the allgatherv gtest to test non-contiguous dst buffers. It is meant to be used in conjunction with my TL/SHM gatherv implementation.

The algorithm is:

  • Node-level gatherv, all sendbufs on the node are put into a scratch buffer
  • Leader-level allgatherv on the scratch buffer, the result is put into the dst buffer
  • Node-level in-place bcast on the dst buffer
  • If the buffer is non-contig, unpack by memcpying the packed buffer into the right displacements

Comparison Data

2 nodes, 32 PPN on lego-grace cg (1 socket, 72 cores/socket)

Size UCC after PR (cl_hier + tl_shm + tl_ucp) UCC before PR (cl_basic + tl_ucp) HCOLL %-Gain, After vs. HCOLL %-Gain, Old vs. HCOLL %-Gain, Before vs. After
1 7.37 19.88 7.41 0.54% -168.29% 62.93%
2 7.81 19.75 7.47 -4.55% -164.39% 60.46%
4 7.62 19.79 7.65 0.39% -158.69% 61.50%
8 8.35 19.78 8.13 -2.71% -143.30% 57.79%
16 9.49 19.8 9.44 -0.53% -109.75% 52.07%
32 9.34 22.91 9.67 3.41% -136.92% 59.23%
64 9.17 22.96 10.03 8.57% -128.91% 60.06%
128 10.47 27.55 10.88 3.77% -153.22% 62.00%
256 12.68 33.37 12.46 -1.77% -167.82% 62.00%
512 19.57 180 25.31 22.68% -611.18% 89.13%
1024 25.38 189.21 308.33 91.77% 38.63% 86.59%
2048 38.86 198.27 311.55 87.53% 36.36% 80.40%

@nsarka nsarka self-assigned this Dec 2, 2024
@nsarka nsarka force-pushed the nsarka/cl-hier-allgatherv branch from 9e34573 to c126e93 Compare December 2, 2024 15:32
@nsarka nsarka force-pushed the nsarka/cl-hier-allgatherv branch 2 times, most recently from 198da51 to 862cad6 Compare December 2, 2024 15:46
@x41lakazam x41lakazam mentioned this pull request Dec 2, 2024
@nsarka nsarka force-pushed the nsarka/cl-hier-allgatherv branch 2 times, most recently from 0774618 to 1298163 Compare December 6, 2024 22:23
@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@manjugv
Copy link
Contributor

manjugv commented Dec 11, 2024

@aamirshafi Please review

@nsarka nsarka force-pushed the nsarka/cl-hier-allgatherv branch from 0305f14 to 30f0ce1 Compare January 23, 2025 16:29
if (ucc_unlikely(cl_team->node_leaders == NULL)) {
cl_team->node_leaders = ucc_malloc(sizeof(ucc_rank_t) * team_size);
if (!cl_team->node_leaders) {
cl_error(team->context->lib, "Could not allocate node_leaders array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error

}
cl_team->leader_list = ucc_malloc(sizeof(ucc_rank_t) * ldr_sbgp_size);
if (!cl_team->node_leaders) {
cl_error(team->context->lib, "Could not allocate leader_list array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error

args.args.src.info.mem_type = UCC_MEMORY_TYPE_HOST;
args.args.dst.info_v.displacements = leader_disps;
args.args.dst.info_v.counts = leader_counts;
args.args.dst.info_v.buffer = args_old.args.dst.info_v.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here dst.Info_v.buffer size might be not enough to hold data because of gaps

cl_team->top_sbgp != UCC_HIER_SBGP_NODE) {
args = args_old;
args.args.coll_type = UCC_COLL_TYPE_BCAST;
args.args.root = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to leader rank id

0, UCC_MSG_MAX, UCC_CL_HIER_DEFAULT_SCORE,
ucc_cl_hier_allgatherv_init, cl_team);
if (UCC_OK != status) {
cl_error(lib, "faild to add range to score_t");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo failed

args->dst.info_v.datatype);
ucc_ee_executor_t *exec;
ucc_status_t status;
int i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> ucc_rank_t

}

schedule->super.status = UCC_INPROGRESS;
schedule->super.super.status = UCC_INPROGRESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set super.super.status


out:
schedule->super.status = st;
schedule->super.super.status = st;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't set it, it may lead to races with multiple threads

ucc_rank_t i;

for (i = 0; i < *n_tasks; i++) {
ucc_ee_executor_task_t *etask = tasks[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define variable at the beginning of function

Comment on lines +11 to +13
ucc_schedule_t *schedule = ucc_derived_of(task, ucc_schedule_t);
ucc_cl_hier_schedule_t *cl_schedule = ucc_derived_of(schedule,
ucc_cl_hier_schedule_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucc_schedule_t *schedule = ucc_derived_of(task, ucc_schedule_t);
ucc_cl_hier_schedule_t *cl_schedule = ucc_derived_of(schedule,
ucc_cl_hier_schedule_t);
ucc_cl_hier_schedule_t *cl_schedule = ucc_derived_of(task,
ucc_cl_hier_schedule_t);

@@ -1,48 +1,55 @@
#
# Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

return cl_team->node_leaders[team_rank];
}

UCC_CL_HIER_PROFILE_FUNC(ucc_status_t, ucc_cl_hier_allgatherv_init,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sergei-Lebedev can you elaborate, is this something we simply don't support in allgatherv?

@@ -0,0 +1,15 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants