You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried running this example on SageMaker today: "Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker" (link)
Using RAPIDS 25.02, the first training job fails like this:
2025-02-11 20:16:47 Starting - Starting the training job...
..25-02-11 20:17:10 Starting - Preparing the instances for training.
..........................ading - Downloading input data.
2025-02-11 20:22:20 Training - Training image download completed. Training in progress.
2025-02-11 20:22:20 Uploading - Uploading generated training model@ entrypoint -> launching training script
Traceback (most recent call last):
File "/opt/ml/code/rapids-higgs.py", line 58, in <module>
main(args)
File "/opt/ml/code/rapids-higgs.py", line 20, in main
data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv
table_w_meta = plc.io.csv.read_csv(options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv
File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv
RuntimeError: copy_if failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered
jameslamb
changed the title
[AWS} SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered"
[AWS] SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered"
Feb 11, 2025
jameslamb
changed the title
[AWS] SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered"
[AWS] SageMaker: Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered"
Feb 12, 2025
Description
Tried running this example on SageMaker today: "Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker" (link)
Using RAPIDS 25.02, the first training job fails like this:
Reproducible Example
Follow https://docs.rapids.ai/deployment/nightly/examples/rapids-sagemaker-higgs/notebook/
Notes
This notebook was not tested with RAPIDS 24.10 or 24.12, because of #443
I'm not sure for how many RAPIDS releases this has been a problem. The issue might be as simple as "you need to use an instance with more memory".
That training job is launched here:
deployment/source/examples/rapids-sagemaker-higgs/notebook.ipynb
Line 575 in e3c471f
This
cudf.read_csv()
call is where it's failing:deployment/source/examples/rapids-sagemaker-higgs/rapids-higgs.py
Line 20 in e3c471f
Limiting the number of rows cuDF tries to read in avoids this error and lows training to succeed.
The text was updated successfully, but these errors were encountered: