Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make braket_container.py thread safe for CUDA-Q BYOC image #679

Open
sterseba opened this issue Feb 13, 2025 · 0 comments
Open

Make braket_container.py thread safe for CUDA-Q BYOC image #679

sterseba opened this issue Feb 13, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@sterseba
Copy link

sterseba commented Feb 13, 2025

Is your feature request related to a problem? Please describe.
The braket_container.py script used for the CUDA-Q BYOC image to launch the user-provided algorithm script is not thread safe, which can create race conditions in paritcular in the step to download, extract and make available the customer code to be executed in the job. This becomes a problem, specifically, for (single and multi-instance) multi-GPU workflows, an area where CUDA-Q can provide acceleration, in particular. While the original script on the amazon-braket-containers repository, does not take into account multiple processes running in an MPI context, at all, the script in this repository at least performs some basic handling of the MPI ranks here:

# Add wait time to resolve race condition
import time

rank = int(os.getenv("OMPI_COMM_WORLD_NODE_RANK", "0"))
time.sleep(rank)

But, this handling is both, inefficient, and ultimately not bullet proof (for example, if the download of the user-provided algorithm code from S3 takes longer than expected).

Describe the solution you'd like
The script should be refactored for real thread safety.

Describe alternatives you've considered
It would be even better, IMO, to improve the original script (https://github.com/amazon-braket/amazon-braket-containers/blob/main/src/braket_container.py) and copy it directly in the Dockerfile rather than duplicating it locally, e.g.:

FROM ...

# other instructions...

RUN git clone --depth=1 https://github.com/amazon-braket/amazon-braket-containers.git
RUN cp amazon-braket-containers/src/braket_container.py /opt/ml/code/braket_container.py
ENV SAGEMAKER_PROGRAM=braket_container.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant