You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
version 1.5.1
python 3.8.10 conda environment with python venv installation as per AMPL README.md installation instructions
test_filesystem_perf_results.py fails on Azure NC6sv2 compute instance with CUDA out-of-memory error
NC6s_v2 is a 1-GPU, 6-core SKU with 16GB GPU memory.
Appears to be an issue with GC and/or failure to release CUDA cache in one or more of the tests and/or model_wrapper when run within a single pytest process. Running the test_AttentitiveFP_results test standalone, or with a --fork flag (using the pytest-forked plugin) succeeds.
These 3 cases below ...
(1) When run with a single process for the 8 tests in test_filesystem_perf_results.py, the test "hangs" using default pytest, i.e. within the compare_models/ directory ...
(2) When pytest is run with the -x flag, the test fails with a CUDA error / out of memory error as below
(3) When pytest is run with --forked (using the pytest-fork plugin), the test passes, as below.
# pytest -x invocation
(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest -x
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items
test_filesystem_perf_results.py ...F
============================================== FAILURES ==============================================
______________________________________ test_AttentiveFP_results ______________________________________
def test_AttentiveFP_results():
clean()
H1_curate()
json_f = 'jsons/reg_config_H1_fit_AttentiveFPModel.json'> df1, df2, model_info = all_similar_tests(json_f, 'H1')
test_filesystem_perf_results.py:188:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test_filesystem_perf_results.py:107: in all_similar_tests
train_and_predict(json_f, prefix=prefix)
../delaney_Panel/test_delaney_panel.py:192: in train_and_predict
model.train_model()
../../../pipeline/model_pipeline.py:573: in train_model
self.model_wrapper = model_wrapper.create_model_wrapper(self.params, self.featurization, self.ds_client)
../../../pipeline/model_wrapper.py:221: in create_model_wrapper
return PytorchDeepChemModelWrapper(params, featurizer, ds_client)
../../../pipeline/model_wrapper.py:2330: in __init__
self.model = self.recreate_model()
../../../pipeline/model_wrapper.py:2355: in recreate_model
model = chosen_model(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/attentivefp.py:276: in __init__
super(AttentiveFPModel, self).__init__(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/torch_model.py:198: in __init__
self.model = model.to(device)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:907: in to
return self._apply(convert)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:601: in _apply
param_applied = fn(param)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:905: in convert
return t.to(device, dtype ift.is_floating_point() or t.is_complex() else None, non_blocking)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def _lazy_init():
global _initialized, _queued_calls
ifis_initialized() or hasattr(_tls, 'is_initializing'):
return
with _initialization_lock:
# We be double-checked locking, boys! This is OK because# the above test was GIL protected anyway. The inner test# is for when a thread blocked on some other thread which was# doing the initialization; when they get the lock, they will# find there is nothing left to do.ifis_initialized():
return# It is important to prevent other threads from entering _lazy_init# immediately, while we are still guaranteed to have the GIL, because some# of the C calls we make below will release the GILif_is_in_bad_fork():
raise RuntimeError(
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with ""multiprocessing, you must use the 'spawn' start method")
if not hasattr(torch._C, '_cuda_getDeviceCount'):
raise AssertionError("Torch not compiled with CUDA enabled")
if _cudart is None:
raise AssertionError(
"libcudart functions unavailable. It looks like you have a broken build?")
# This function throws if there's a driver initialization error, no GPUs# are found or any other error occurs>torch._C._cuda_init()
E RuntimeError: CUDA error: out of memory
E CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
E For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/cuda/__init__.py:216: RuntimeError
---------------------------------------- Captured stdout call ----------------------------------------
num_model_tasks is deprecated and its value is ignored.
========================================== warnings summary ==========================================
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
/home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/featurization.py:1730: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` calc_smiles_feat_df[col] = calc_desc_df[col]integrative/compare_models/test_filesystem_perf_results.py::test_RF_resultsintegrative/compare_models/test_filesystem_perf_results.py::test_RF_resultsintegrative/compare_models/test_filesystem_perf_results.py::test_XGB_resultsintegrative/compare_models/test_filesystem_perf_results.py::test_XGB_results /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/transformations.py:255: RuntimeWarning: invalid value encountered in true_divide X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)integrative/compare_models/test_filesystem_perf_results.py::test_NN_results /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/model_wrapper.py:2592: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. chkpt_dict = yaml.load(chkpt_in.read())-- Docs: https://docs.pytest.org/en/stable/warnings.html====================================== short test summary info =======================================FAILED test_filesystem_perf_results.py::test_AttentiveFP_results - RuntimeError: CUDA error: out of...!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!======================== 1 failed, 3 passed, 9 warnings in 319.56s (0:05:19) =========================
version 1.5.1
python 3.8.10 conda environment with python venv installation as per AMPL README.md installation instructions
test_filesystem_perf_results.py fails on Azure NC6sv2 compute instance with CUDA out-of-memory error
NC6s_v2 is a 1-GPU, 6-core SKU with 16GB GPU memory.
Appears to be an issue with GC and/or failure to release CUDA cache in one or more of the tests and/or model_wrapper when run within a single pytest process. Running the test_AttentitiveFP_results test standalone, or with a --fork flag (using the pytest-forked plugin) succeeds.
These 3 cases below ...
(1) When run with a single process for the 8 tests in test_filesystem_perf_results.py, the test "hangs" using default pytest, i.e. within the compare_models/ directory ...
(2) When pytest is run with the -x flag, the test fails with a CUDA error / out of memory error as below
(3) When pytest is run with --forked (using the pytest-fork plugin), the test passes, as below.
The text was updated successfully, but these errors were encountered: