You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried running this example on SageMaker today: "Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker" (link)
Using RAPIDS 25.02, the first training job fails like this when using ml.p3.2xlarge instances:
xgboost.core.XGBoostError: [02:14:02] /home/conda/feedstock_root/build_artifacts/xgboost-split_1733189075592/work/src/c_api/../common/device_helpers.cuh:393: Memory allocation error on worker 0: std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp
more traceback (click me)
Traceback (most recent call last):
File "/opt/ml/code/train.py", line 75, in <module>
train()
File "/opt/ml/code/train.py", line 43, in train
trained_model = ml_workflow.fit(X_train, y_train)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ml/code/MLWorkflow.py", line 90, in timed_execution_wrapper
result = target_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ml/code/workflows/MLWorkflowMultiGPU.py", line 156, in fit
xgboost_output = xgboost.dask.train(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
return func(**kwargs)
^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1090, in train
return client.sync(
^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 363, in sync
return sync(
^^^^^
File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 439, in sync
raise error
File "/opt/conda/lib/python3.12/site-packages/distributed/utils.py", line 413, in f
result = yield future
^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/tornado/gen.py", line 766, in run
value = future.result()
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 1026, in _train_async
result = await map_worker_partitions(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 551, in map_worker_partitions
result = await client.compute(fut).result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/distributed/client.py", line 410, in _result
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 535, in <lambda>
lambda *args, **kwargs: [func(*args, **kwargs)],
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 981, in dispatched_train
Xy, evals = _get_dmatrices(
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 923, in _get_dmatrices
Xy = _dmatrix_from_list_of_parts(**train_ref, nthread=n_threads)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 830, in _dmatrix_from_list_of_parts
return _create_dmatrix(**kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 815, in _create_dmatrix
dmatrix = DMatrix(
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
return func(**kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 890, in __init__
self.set_info(
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 726, in inner_f
return func(**kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 954, in set_info
self.set_label(label)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 1092, in set_label
dispatch_meta_backend(self, label, "label", "float")
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1355, in dispatch_meta_backend
_meta_from_cupy_array(data, name, handle)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/data.py", line 1312, in _meta_from_cupy_array
_check_call(_LIB.XGDMatrixSetInfoFromInterface(handle, c_str(field), interface))
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
^^^^^^^^^^^^^^^^^
2025-02-11 20:16:47 Starting - Starting the training job...
..25-02-11 20:17:10 Starting - Preparing the instances for training.
..........................ading - Downloading input data.
2025-02-11 20:22:20 Training - Training image download completed. Training in progress.
2025-02-11 20:22:20 Uploading - Uploading generated training model@ entrypoint -> launching training script
Traceback (most recent call last):
File "/opt/ml/code/rapids-higgs.py", line 58, in <module>
main(args)
File "/opt/ml/code/rapids-higgs.py", line 20, in main
data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv
table_w_meta = plc.io.csv.read_csv(options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv
File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv
RuntimeError: copy_if failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered
Description
Tried running this example on SageMaker today: "Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker" (link)
Using RAPIDS 25.02, the first training job fails like this when using
ml.p3.2xlarge
instances:more traceback (click me)
Reproducible Example
Follow https://docs.rapids.ai/deployment/nightly/examples/rapids-sagemaker-hpo/notebook/, use
ml.p3.2xlarge
instances for training jobs.Notes
This notebook was not tested with RAPIDS 24.10 or 24.12, because of #443
I'm not sure for how many RAPIDS releases this has been a problem. The issue might be as simple as "you need to use an instance with more memory".
The example does recommend using an
ml.p3.8xlarge
, but I could not get any in us-east-1 tonight (got errors saying "no capacity, retrying").deployment/source/examples/rapids-sagemaker-hpo/notebook.ipynb
Lines 468 to 469 in e3c471f
The text was updated successfully, but these errors were encountered: