Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] SageMaker: Hyperparameter Optimization example fails on ml.p3.2xlarge: "std::bad_alloc: out_of_memory" #523

Open
jameslamb opened this issue Feb 12, 2025 · 0 comments
Labels
bug Something isn't working cloud/aws Amazon Web Service cloud

Comments

@jameslamb
Copy link
Member

Description

Tried running this example on SageMaker today: "Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker" (link)

Using RAPIDS 25.02, the first training job fails like this when using ml.p3.2xlarge instances:

xgboost.core.XGBoostError: [02:14:02] /home/conda/feedstock_root/build_artifacts/xgboost-split_1733189075592/work/src/c_api/../common/device_helpers.cuh:393: Memory allocation error on worker 0: std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

more traceback (click me)
Traceback (most recent call last):
  File "/opt/ml/code/train.py", line 75, in <module>
    train()
  File "/opt/ml/code/train.py", line 43, in train
    trained_model = ml_workflow.fit(X_train, y_train)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ml/code/MLWorkflow.py", line 90, in timed_execution_wrapper
    result = target_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ml/code/workflows/MLWorkflowMultiGPU.py", line 156, in fit
    xgboost_output = xgboost.dask.train(
                     ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1090, in train
    return client.sync(
           ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 363, in sync
    return sync(
           ^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 413, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1026, in _train_async
    result = await map_worker_partitions(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 551, in map_worker_partitions
    result = await client.compute(fut).result()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/distributed/client.py", line 410, in _result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 535, in <lambda>
    lambda *args, **kwargs: [func(*args, **kwargs)],
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 981, in dispatched_train
    Xy, evals = _get_dmatrices(
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 923, in _get_dmatrices
    Xy = _dmatrix_from_list_of_parts(**train_ref, nthread=n_threads)
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 830, in _dmatrix_from_list_of_parts
    return _create_dmatrix(**kwargs)
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 815, in _create_dmatrix
    dmatrix = DMatrix(
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 890, in __init__
    self.set_info(
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
    return func(**kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 954, in set_info
    self.set_label(label)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 1092, in set_label
    dispatch_meta_backend(self, label, "label", "float")
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1355, in dispatch_meta_backend
    _meta_from_cupy_array(data, name, handle)
    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1312, in _meta_from_cupy_array
    _check_call(_LIB.XGDMatrixSetInfoFromInterface(handle, c_str(field), interface))
      ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 284, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    ^^^^^^^^^^^^^^^^^
2025-02-11 20:16:47 Starting - Starting the training job...
..25-02-11 20:17:10 Starting - Preparing the instances for training.
..........................ading - Downloading input data.
2025-02-11 20:22:20 Training - Training image download completed. Training in progress.
2025-02-11 20:22:20 Uploading - Uploading generated training model@ entrypoint -> launching training script 
Traceback (most recent call last):
  File "/opt/ml/code/rapids-higgs.py", line 58, in <module>
    main(args)
  File "/opt/ml/code/rapids-higgs.py", line 20, in main
    data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv
    table_w_meta = plc.io.csv.read_csv(options)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv
  File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv
RuntimeError: copy_if failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

Reproducible Example

Follow https://docs.rapids.ai/deployment/nightly/examples/rapids-sagemaker-hpo/notebook/, use ml.p3.2xlarge instances for training jobs.

Notes

This notebook was not tested with RAPIDS 24.10 or 24.12, because of #443

I'm not sure for how many RAPIDS releases this has been a problem. The issue might be as simple as "you need to use an instance with more memory".

The example does recommend using an ml.p3.8xlarge, but I could not get any in us-east-1 tonight (got errors saying "no capacity, retrying").

"recommended instance type : ml.p3.8xlarge \n",
"instance details : 4x GPUs [ V100 ], 64GB GPU memory, 244GB CPU memory\n"

@jameslamb jameslamb added bug Something isn't working cloud/aws Amazon Web Service cloud labels Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud/aws Amazon Web Service cloud
Projects
None yet
Development

No branches or pull requests

1 participant