[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

jameslamb · 2025-02-12T02:51:52Z

Description

Tried running this example on SageMaker today: "Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker" (link)

Using RAPIDS 25.02, the first training job fails like this when using ml.p3.2xlarge instances:

xgboost.core.XGBoostError: [02:14:02] /home/conda/feedstock_root/build_artifacts/xgboost-split_1733189075592/work/src/c_api/../common/device_helpers.cuh:393: Memory allocation error on worker 0: std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

more traceback (click me)

Traceback (most recent call last):
  File "/opt/ml/code/train.py", line 75, in <module>
    train()
  File "/opt/ml/code/train.py", line 43, in train
    trained_model = ml_workflow.fit(X_train, y_train)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ml/code/MLWorkflow.py", line 90, in timed_execution_wrapper
    result = target_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ml/code/workflows/MLWorkflowMultiGPU.py", line 156, in fit
    xgboost_output = xgboost.dask.train(
                     ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1090, in train
    return client.sync(
           ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 363, in sync
    return sync(
           ^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 413, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1026, in _train_async
    result = await map_worker_partitions(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 551, in map_worker_partitions
    result = await client.compute(fut).result()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/client.py", line 410, in _result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 535, in <lambda>
    lambda *args, **kwargs: [func(*args, **kwargs)],
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 981, in dispatched_train
    Xy, evals = _get_dmatrices(
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 923, in _get_dmatrices
    Xy = _dmatrix_from_list_of_parts(**train_ref, nthread=n_threads)
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 830, in _dmatrix_from_list_of_parts
    return _create_dmatrix(**kwargs)
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 815, in _create_dmatrix
    dmatrix = DMatrix(
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 890, in __init__
    self.set_info(
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 954, in set_info
    self.set_label(label)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 1092, in set_label
    dispatch_meta_backend(self, label, "label", "float")
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1355, in dispatch_meta_backend
    _meta_from_cupy_array(data, name, handle)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1312, in _meta_from_cupy_array
    _check_call(_LIB.XGDMatrixSetInfoFromInterface(handle, c_str(field), interface))
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 284, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    ^^^^^^^^^^^^^^^^^

2025-02-11 20:16:47 Starting - Starting the training job...
..25-02-11 20:17:10 Starting - Preparing the instances for training.
..........................ading - Downloading input data.
2025-02-11 20:22:20 Training - Training image download completed. Training in progress.
2025-02-11 20:22:20 Uploading - Uploading generated training model@ entrypoint -> launching training script 
Traceback (most recent call last):
  File "/opt/ml/code/rapids-higgs.py", line 58, in <module>
    main(args)
  File "/opt/ml/code/rapids-higgs.py", line 20, in main
    data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv
    table_w_meta = plc.io.csv.read_csv(options)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv
  File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv
RuntimeError: copy_if failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

Reproducible Example

Follow https://docs.rapids.ai/deployment/nightly/examples/rapids-sagemaker-hpo/notebook/, use ml.p3.2xlarge instances for training jobs.

Notes

This notebook was not tested with RAPIDS 24.10 or 24.12, because of #443

I'm not sure for how many RAPIDS releases this has been a problem. The issue might be as simple as "you need to use an instance with more memory".

The example does recommend using an ml.p3.8xlarge, but I could not get any in us-east-1 tonight (got errors saying "no capacity, retrying").

deployment/source/examples/rapids-sagemaker-hpo/notebook.ipynb

Lines 468 to 469 in e3c471f

    
           "recommended instance type : ml.p3.8xlarge \n", 
        
           "instance details          : 4x GPUs [ V100 ], 64GB GPU memory,  244GB CPU memory\n"

The text was updated successfully, but these errors were encountered:

jameslamb added bug Something isn't working cloud/aws Amazon Web Service cloud labels Feb 12, 2025

This was referenced Feb 12, 2025

use 'sagemaker-python-sdk' instead of 'sagemaker' #504

Merged

Release 25.02 Checklist #506

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

jameslamb commented Feb 12, 2025

[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

Comments

jameslamb commented Feb 12, 2025

Description

Reproducible Example

Notes