Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

murtaza-nasir · 2024-04-15T08:28:47Z

NVIDIA Open GPU Kernel Modules Version

this one

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

6.5.0-27-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

I am running on a stable kernel release.

Hardware: GPU

all (4x 3090)

Describe the bug

I installed this driver, and torch.cuda.can_device_access_peer(a, b) gives me TRUE for all gpus.

I get the following error when textgenwebui tries to load a model:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Aphrodite also crashes when loading any model.

To Reproduce

I installed this driver on ubuntu.

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

No response

dbateyko · 2024-04-15T20:55:12Z

fwiw, others are experiencing similar problems. I'm experiencing the same error in text-generation-ui at inference (using the ExLlamav2 model loader) after enabling resizable bars on two 3090s, but before installing this driver. It may be a problem in text-generation-ui? In any case, I'm following for a solution.

murtaza-nasir · 2024-04-15T21:27:30Z

fwiw, others are experiencing similar problems. I'm experiencing the same error in text-generation-ui at inference (using the ExLlamav2 model loader) after enabling resizable bars on two 3090s, but before installing this driver. It may be a problem in text-generation-ui? In any case, I'm following for a solution.

If you're referring to that post on y-combinator, that is me. I got this error after installing this driver.

geohot · 2024-04-15T21:43:43Z

This is only tested on 4090s, no idea if it works on anything else.

Though if you don't have large BAR on your 3090s, I can confirm it won't work.

murtaza-nasir · 2024-04-15T22:38:54Z

This is only tested on 4090s, no idea if it works on anything else.

Though if you don't have large BAR on your 3090s, I can confirm it won't work.

I did check with lspci and all my GPUs show the 32G line. Not sure why I'm getting this error. I'm on a fresh ubuntu install. I don't have IOMMU enabled in the ubuntu grub settings but I think I still didn't disable it in my BIOS. Will try that and see if that is the problem.

Edit: I disabled IOMMU in the BIOS but still see this error.

brthor · 2024-05-14T01:55:53Z

This is working for me with 3090s.

Didn't have to do anything but enable resizable BAR in the bios.

Ensure you have the correct driver version installed.

Low perf here is probably from the motherboard.

nvidia-smi

$ nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0A:00.0 Off |                  N/A |
| 42%   37C    P0            115W /  350W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:0B:00.0 Off |                  N/A |
| 39%   33C    P0            115W /  350W |       0MiB /  24576MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

$ nvidia-smi topo -p2p rw
        GPU0    GPU1    
 GPU0   X       OK      
 GPU1   OK      X

p2pBandwidthLatencyTest

$ ./cuda-samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

...

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 829.79   6.14 
     1   6.14 831.55 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 821.94  13.18 
     1  13.18 832.81

NCCL

$ ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 2

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456     float     sum      -1   117080    9.17    9.17      0   116952    9.18    9.18      0
  2147483648     536870912     float     sum      -1   234000    9.18    9.18      0   233994    9.18    9.18      0
  4294967296    1073741824     float     sum      -1   468088    9.18    9.18      0   467922    9.18    9.18      0
dce51d9dafe1:992:992 [1] NCCL INFO comm 0x55a73639d730 rank 0 nranks 2 cudaDev 0 busId a000 - Destroy COMPLETE
dce51d9dafe1:992:992 [1] NCCL INFO comm 0x55a7363a3530 rank 1 nranks 2 cudaDev 1 busId b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.17687

vs. NCCL_P2P_DISABLE=1

$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 2

...

# Avg bus bandwidth    : 7.17877

t13m · 2024-05-30T07:12:38Z

Hi @brthor, how did you enable large bar1 in 3090s? Can you share your method if you don't mind? Or is there any tutorial/instructions anywhere? Thank you!

murtaza-nasir · 2024-05-30T07:23:13Z

Hi @brthor, how did you enable large bar1 in 3090s? Can you share your method if you don't mind? Or is there any tutorial/instructions anywhere? Thank you!

Your GPU will have it if your motherboard supports it and you have it turned on.

t13m · 2024-05-30T07:28:18Z

Like turn it on in the BIOS of motherboard? Which motherboard are you using? Do GPU vbios or firmware need to be updated?

murtaza-nasir · 2024-05-30T07:33:18Z

Yes you just turn it on in BIOS. Make sure you have above 4G decoding and rebar support enabled. My TR Zenith II Extreme has it and the GPUs show large bar support. I have an EPYC supermicro H12SSLi that doesn't have rebar in the bios so the 3090s don't show it when checked.

t13m · 2024-05-30T07:36:24Z

It helps a lot, thank you!

brthor · 2024-05-30T18:28:31Z

@t13m Resizeable bar must be supported in the vBios of the gpu first of all, this has been the case with the 3090s I have.

If you don't have motherboard support you may be able to use https://github.com/xCuri0/ReBarUEFI

You can also try setting NVReg_EnableResizableBar=1 (do a google search where to set this, it is some modprobe.d file), but I didn't have success with this method.

scouzi1966 · 2024-06-05T15:36:46Z

I'm perplexed as to why isn't this more popular? Another question. Could I mix a 4090 with a 3090? What would be the drawbacks? I would like to get the benefits of more memory vs more performance. Is performance the only downside in running a 3090/4090 combo?

murtaza-nasir · 2024-06-05T17:43:01Z

I would like to get the benefits of more memory vs more performance. Is performance the only downside in running a 3090/4090 combo?

Yes if you have a 4090 and just want more memory, a 3090 will do that. However, you would be stuck at the 3090s performance level. I would personally prefer to have 3x 3090s vs 1x 4090 and 1x 3090.

murtaza-nasir added the bug Something isn't working label Apr 15, 2024

hobodrifterdavid mentioned this issue Jun 28, 2024

Issue building DockerFile Wordcab/wordcab-transcribe#315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

murtaza-nasir commented Apr 15, 2024

dbateyko commented Apr 15, 2024 •

edited

Loading

murtaza-nasir commented Apr 15, 2024

geohot commented Apr 15, 2024

murtaza-nasir commented Apr 15, 2024 •

edited

Loading

brthor commented May 14, 2024 •

edited

Loading

t13m commented May 30, 2024

murtaza-nasir commented May 30, 2024

t13m commented May 30, 2024

murtaza-nasir commented May 30, 2024

t13m commented May 30, 2024

brthor commented May 30, 2024

scouzi1966 commented Jun 5, 2024

murtaza-nasir commented Jun 5, 2024

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

Comments

murtaza-nasir commented Apr 15, 2024

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

dbateyko commented Apr 15, 2024 • edited Loading

murtaza-nasir commented Apr 15, 2024

geohot commented Apr 15, 2024

murtaza-nasir commented Apr 15, 2024 • edited Loading

brthor commented May 14, 2024 • edited Loading

nvidia-smi

p2pBandwidthLatencyTest

NCCL

t13m commented May 30, 2024

murtaza-nasir commented May 30, 2024

t13m commented May 30, 2024

murtaza-nasir commented May 30, 2024

t13m commented May 30, 2024

brthor commented May 30, 2024

scouzi1966 commented Jun 5, 2024

murtaza-nasir commented Jun 5, 2024

dbateyko commented Apr 15, 2024 •

edited

Loading

murtaza-nasir commented Apr 15, 2024 •

edited

Loading

brthor commented May 14, 2024 •

edited

Loading