Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] SageMaker: Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered" #521

Open
jameslamb opened this issue Feb 11, 2025 · 0 comments
Labels
bug Something isn't working cloud/aws Amazon Web Service cloud

Comments

@jameslamb
Copy link
Member

jameslamb commented Feb 11, 2025

Description

Tried running this example on SageMaker today: "Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker" (link)

Using RAPIDS 25.02, the first training job fails like this:

2025-02-11 20:16:47 Starting - Starting the training job...
..25-02-11 20:17:10 Starting - Preparing the instances for training.
..........................ading - Downloading input data.
2025-02-11 20:22:20 Training - Training image download completed. Training in progress.
2025-02-11 20:22:20 Uploading - Uploading generated training model@ entrypoint -> launching training script 
Traceback (most recent call last):
  File "/opt/ml/code/rapids-higgs.py", line 58, in <module>
    main(args)
  File "/opt/ml/code/rapids-higgs.py", line 20, in main
    data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv
    table_w_meta = plc.io.csv.read_csv(options)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv
  File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv
RuntimeError: copy_if failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

Reproducible Example

Follow https://docs.rapids.ai/deployment/nightly/examples/rapids-sagemaker-higgs/notebook/

Notes

This notebook was not tested with RAPIDS 24.10 or 24.12, because of #443

I'm not sure for how many RAPIDS releases this has been a problem. The issue might be as simple as "you need to use an instance with more memory".

That training job is launched here:

"rapids_estimator.fit(inputs=s3_data_dir)"

This cudf.read_csv() call is where it's failing:

data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)

Limiting the number of rows cuDF tries to read in avoids this error and lows training to succeed.

data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls, nrows=10_000)
@jameslamb jameslamb added bug Something isn't working cloud/aws Amazon Web Service cloud labels Feb 11, 2025
@jameslamb jameslamb changed the title [AWS} SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered" [AWS] SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered" Feb 11, 2025
@jameslamb jameslamb changed the title [AWS] SageMaker Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered" [AWS] SageMaker: Higgs example fails on ml.p3.2xlarge: "an illegal memory access was encountered" Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud/aws Amazon Web Service cloud
Projects
None yet
Development

No branches or pull requests

1 participant