Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrative test test_filesystem_perf_results.py::test_AttentiveFP_results fails with GPU memory error on AzureML NC6sv2 compute SKU #217

Open
bmoxon opened this issue Apr 5, 2023 · 1 comment

Comments

@bmoxon
Copy link

bmoxon commented Apr 5, 2023

version 1.5.1
python 3.8.10 conda environment with python venv installation as per AMPL README.md installation instructions

test_filesystem_perf_results.py fails on Azure NC6sv2 compute instance with CUDA out-of-memory error
NC6s_v2 is a 1-GPU, 6-core SKU with 16GB GPU memory.

Appears to be an issue with GC and/or failure to release CUDA cache in one or more of the tests and/or model_wrapper when run within a single pytest process. Running the test_AttentitiveFP_results test standalone, or with a --fork flag (using the pytest-forked plugin) succeeds.

These 3 cases below ...

(1) When run with a single process for the 8 tests in test_filesystem_perf_results.py, the test "hangs" using default pytest, i.e. within the compare_models/ directory ...

(2) When pytest is run with the -x flag, the test fails with a CUDA error / out of memory error as below

(3) When pytest is run with --forked (using the pytest-fork plugin), the test passes, as below.

# standard pytest invocation
(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ...F<hangs here>
# pytest -x invocation
(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest -x
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ...F

============================================== FAILURES ==============================================
______________________________________ test_AttentiveFP_results ______________________________________

    def test_AttentiveFP_results():
        clean()
        H1_curate()
        json_f = 'jsons/reg_config_H1_fit_AttentiveFPModel.json'
    
>       df1, df2, model_info = all_similar_tests(json_f, 'H1')

test_filesystem_perf_results.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_filesystem_perf_results.py:107: in all_similar_tests
    train_and_predict(json_f, prefix=prefix)
../delaney_Panel/test_delaney_panel.py:192: in train_and_predict
    model.train_model()
../../../pipeline/model_pipeline.py:573: in train_model
    self.model_wrapper = model_wrapper.create_model_wrapper(self.params, self.featurization, self.ds_client)
../../../pipeline/model_wrapper.py:221: in create_model_wrapper
    return PytorchDeepChemModelWrapper(params, featurizer, ds_client)
../../../pipeline/model_wrapper.py:2330: in __init__
    self.model = self.recreate_model()
../../../pipeline/model_wrapper.py:2355: in recreate_model
    model = chosen_model(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/attentivefp.py:276: in __init__
    super(AttentiveFPModel, self).__init__(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/torch_model.py:198: in __init__
    self.model = model.to(device)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:907: in to
    return self._apply(convert)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:601: in _apply
    param_applied = fn(param)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:905: in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _lazy_init():
        global _initialized, _queued_calls
        if is_initialized() or hasattr(_tls, 'is_initializing'):
            return
        with _initialization_lock:
            # We be double-checked locking, boys!  This is OK because
            # the above test was GIL protected anyway.  The inner test
            # is for when a thread blocked on some other thread which was
            # doing the initialization; when they get the lock, they will
            # find there is nothing left to do.
            if is_initialized():
                return
            # It is important to prevent other threads from entering _lazy_init
            # immediately, while we are still guaranteed to have the GIL, because some
            # of the C calls we make below will release the GIL
            if _is_in_bad_fork():
                raise RuntimeError(
                    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
                    "multiprocessing, you must use the 'spawn' start method")
            if not hasattr(torch._C, '_cuda_getDeviceCount'):
                raise AssertionError("Torch not compiled with CUDA enabled")
            if _cudart is None:
                raise AssertionError(
                    "libcudart functions unavailable. It looks like you have a broken build?")
            # This function throws if there's a driver initialization error, no GPUs
            # are found or any other error occurs
>           torch._C._cuda_init()
E           RuntimeError: CUDA error: out of memory
E           CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/cuda/__init__.py:216: RuntimeError
---------------------------------------- Captured stdout call ----------------------------------------
num_model_tasks is deprecated and its value is ignored.
========================================== warnings summary ==========================================
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/featurization.py:1730: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
    calc_smiles_feat_df[col] = calc_desc_df[col]

integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/transformations.py:255: RuntimeWarning: invalid value encountered in true_divide
    X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

integrative/compare_models/test_filesystem_perf_results.py::test_NN_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/model_wrapper.py:2592: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    chkpt_dict = yaml.load(chkpt_in.read())

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================== short test summary info =======================================
FAILED test_filesystem_perf_results.py::test_AttentiveFP_results - RuntimeError: CUDA error: out of...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================== 1 failed, 3 passed, 9 warnings in 319.56s (0:05:19) =========================
# pytest --fork invocation

(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest --forked
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ........                                                       [100%]

=================================== 8 passed in 443.37s (0:07:23) ====================================
@stewarthe6
Copy link
Collaborator

Hi, we're unable to test this directly, but we are looking into scaling down our tests. That maybe resolve the issue. Thank you for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants