[BUG] Tests for compression fail on GPU servers with bitsandbytes installed #507

mryab · 2022-09-10T14:59:04Z

Describe the bug
While working on #490, I found that if I have bitsandbytes installed in a GPU-enabled environment, I get an error when running test_adaptive_compression, which happens to be the only test that uses TrainingAverager under the hood.

I dug into it a bit, and the failure seems to be caused by CUDA error: initialization error from PyTorch, which AFAIK emerges when we're trying to initialize the CUDA context twice. More specifically, it appears when we are trying to initialize the optimizer states in TrainingAverager. My guess is that the context is created when importing bitsandbytes first and then when using something (anything?) from GPU-enabled PyTorch later. We are sunsetting the support for TrainingAverager anyway, but to me it's not obvious how to correctly migrate from this class in a given test.

To Reproduce
Install the environment in a GPU-enabled system, try running CUDA_LAUNCH_BLOCKING=1 pytest -s --full-trace tests/test_compression.py. Then uninstall bitsandbytes, comment out the parts in test_compression that rely on it (mostly test_tensor_compression), run the same command.

Environment

Python 3.8.8
Commit 131f82c
PyTorch 1.12.1, bitsandbytes 0.32.3
NVIDIA RTX 2080 Ti GPU

The text was updated successfully, but these errors were encountered:

mryab added bug Something isn't working ci Continuous Integration, tests, or deployment labels Sep 10, 2022

mryab assigned justheuristic Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Tests for compression fail on GPU servers with bitsandbytes installed #507

[BUG] Tests for compression fail on GPU servers with bitsandbytes installed #507

mryab commented Sep 10, 2022

[BUG] Tests for compression fail on GPU servers with bitsandbytes installed #507

[BUG] Tests for compression fail on GPU servers with bitsandbytes installed #507

Comments

mryab commented Sep 10, 2022