-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX 1.16.0 upgrade #10190
UCX 1.16.0 upgrade #10190
Conversation
Signed-off-by: Alessandro Bellina <[email protected]>
build |
build |
build |
build |
1 similar comment
build |
…ebug issue" This reverts commit 768f56d.
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits, but I am not happy with pulling in an release candidate for UCX. Is there a timeline for when it will be fully released?
@@ -33,11 +33,11 @@ ARG UCX_VER | |||
ARG UCX_CUDA_VER | |||
ARG UCX_ARCH | |||
|
|||
RUN yum update -y && yum install -y wget bzip2 numactl-libs libgomp | |||
# note that libibmad is a temporary workaround for a missed dependency in ucx-1.16 rpm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue to fix this? Or to move to a non-rc version of UCX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be included in the next RC for UCX. The UCX team is working on it here: openucx/ucx#9628.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small precision, libibmad
/libibumad
dep as just been removed by openucx/ucx#9625 and openucx/ucx#9627. other PRs are longer term effort.
RUN ls /usr/lib | ||
RUN mkdir /tmp/ucx_install && cd /tmp/ucx_install && \ | ||
wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-$UCX_VER-centos8-mofed5-cuda$UCX_CUDA_VER-$UCX_ARCH.tar.bz2 && \ | ||
tar -xvf *.bz2 && \ | ||
rpm -i ucx-$UCX_VER*.rpm && \ | ||
rpm -i ucx-cuda-$UCX_VER*.rpm --nodeps && \ | ||
rpm -i `ls ucx-[0-9]*.rpm ucx-cuda-[0-9]*.rpm` --nodeps && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the rpm is not named consistently with the version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct :( It is an RC issue as well, when you drop the RC it works. That said, I figured this was a bit more robust, because we have nothing else in this directory /tmp/ucx_install
, so it's only UCX at this rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me other than resolving the "should we depend on a release candidate now rather than wait" issue.
I understand this is a release candidate for UCX. I am ok using it since we are doing most of the testing on UCX 1.16-rc for our use case. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
This reverts commit 8ab25cd. Signed-off-by: Alessandro Bellina <[email protected]>
This reverts commit 8ab25cd. Signed-off-by: Alessandro Bellina <[email protected]>
This reverts commit 8ab25cd. Signed-off-by: Alessandro Bellina <[email protected]>
I'd like for us to merge and get this through CI. Technically, this is using UCX 1.16 at an RC level, which really impacts the jucx JAR. This was done to pick up an arm version of libucx. I have updated dockerfiles to install UCX 1.16-rc1 native bits, and as UCX releases this fully we will be able to remove the "rc" part of it.
Closes #7116.
Related UCX issue openucx/ucx#9523.