Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang and crash when disconnecting on Windows #314

Open
ahabersaat opened this issue Jan 24, 2024 · 1 comment
Open

Hang and crash when disconnecting on Windows #314

ahabersaat opened this issue Jan 24, 2024 · 1 comment

Comments

@ahabersaat
Copy link

ahabersaat commented Jan 24, 2024

Hi.

I'm using libsmb2 on a Windows 11 laptop to connect to a Samba drive on a Linux embedded device. The connection happens through a USB3 cable, where the embedded device is detected as a network card on the laptop using RNDIS.

It has been working well so far, with the exception that if I unplug the USB cable during a transfer, then I have a crash when I delete the context later with smb2_destroy_context.

Now I tried to update the lib to the latest commit on master, as my prior version of libsmb2 was from June 2022.

Hang forever

But now with the latest new version, I get an infinite hang if I unplug the cable during a read operation using the sync API:

  • It stays stuck forever in the wait_for_reply function.
  • Adding a timeout does not change anything, because fd is never invalidated to -1 and the function does not return.
  • In smb2_service_fd, the polling raises flags POLLERR and POLLHUP. The POLLHUP block is never reached due to the goto in POLLERR block. However, in both cases I don't see anything that could invalidate the fd there.

Fix

EDIT: After checking the diff since June 2022, I see that smb2_service and smb2_service_fd now return a t_socket type instead of an int. But on Windows, this type is defined as unsigned, so returning -1 and checking for < 0 does not work, causing the function to not return and thus the infinite hang.
The faulty commit seems to be 4b1ef1f6bd09f96a9f0b78accfeac6e2c2766f34.

Crash

EDIT2: With the latest version and the hang fixed, if I unplug the USB cable while it's transferring a file, and then I close the SMB client and call smb2_destroy_context, then I get something looking like memory corruption, causing fully random crashes in the rest of the application. Commenting the line smb2_destroy_context in the destructor of my client fixes these crashes.
Any idea what could go wrong in that case ? This seems very difficult to find and fix (on Windows, without valgrind); I'll try to replace the calloc and free functions and put prints, but the nature of the code with multiple callbacks and generic void* types makes it extremely difficult to debug.

Fix

EDIT3: After two full days of analysis, the commit 0dc1cc795512c808b52ecc83f7cd3e530936655d which creates a special case for the close operation is the cause of my crash.
In my workflow, I read a file and then close it using smb2_close() (sync). However, if the USB network interface is removed during the transfer, the read operation fails, but the smb2_close() is still called to simplify the logic, expecting it to fail as well. In this smb2_close(), the wait_for_reply() call fails with code -1 due to an expected error in smb2_service(), also returning -1. However in that case, the callback remains set somewhere in a PDU that was queued. So calling the free() at the end of smb2_close() frees this memory that is still referenced in a PDU callback.
Then during the call to smb2_destroy_context when destroying the client, the PDUs in the queues are cancelled and all callback are called. But as the one for the close operation was already freed, it crashes !

What could be done to fix these issues? Reverting these 2 commits is maybe not the best fix, as one of them was added to fix a memory leak apparently.
In my case I'll fix locally already, because a leak is better than a crash.

@ahabersaat ahabersaat changed the title Hang when disconnecting on Windows Hang and crash when disconnecting on Windows Jan 25, 2024
@Wolf3s
Copy link
Contributor

Wolf3s commented Feb 6, 2024

Hi.

I'm using libsmb2 on a Windows 11 laptop to connect to a Samba drive on a Linux embedded device. The connection happens through a USB3 cable, where the embedded device is detected as a network card on the laptop using RNDIS.

It has been working well so far, with the exception that if I unplug the USB cable during a transfer, then I have a crash when I delete the context later with smb2_destroy_context.

Now I tried to update the lib to the latest commit on master, as my prior version of libsmb2 was from June 2022.

Hang forever

But now with the latest new version, I get an infinite hang if I unplug the cable during a read operation using the sync API:

  • It stays stuck forever in the wait_for_reply function.
  • Adding a timeout does not change anything, because fd is never invalidated to -1 and the function does not return.
  • In smb2_service_fd, the polling raises flags POLLERR and POLLHUP. The POLLHUP block is never reached due to the goto in POLLERR block. However, in both cases I don't see anything that could invalidate the fd there.

Fix

EDIT: After checking the diff since June 2022, I see that smb2_service and smb2_service_fd now return a t_socket type instead of an int. But on Windows, this type is defined as unsigned, so returning -1 and checking for < 0 does not work, causing the function to not return and thus the infinite hang. The faulty commit seems to be 4b1ef1f6bd09f96a9f0b78accfeac6e2c2766f34.

Crash

EDIT2: With the latest version and the hang fixed, if I unplug the USB cable while it's transferring a file, and then I close the SMB client and call smb2_destroy_context, then I get something looking like memory corruption, causing fully random crashes in the rest of the application. Commenting the line smb2_destroy_context in the destructor of my client fixes these crashes. Any idea what could go wrong in that case ? This seems very difficult to find and fix (on Windows, without valgrind); I'll try to replace the calloc and free functions and put prints, but the nature of the code with multiple callbacks and generic void* types makes it extremely difficult to debug.

Fix

EDIT3: After two full days of analysis, the commit 0dc1cc795512c808b52ecc83f7cd3e530936655d which creates a special case for the close operation is the cause of my crash. In my workflow, I read a file and then close it using smb2_close() (sync). However, if the USB network interface is removed during the transfer, the read operation fails, but the smb2_close() is still called to simplify the logic, expecting it to fail as well. In this smb2_close(), the wait_for_reply() call fails with code -1 due to an expected error in smb2_service(), also returning -1. However in that case, the callback remains set somewhere in a PDU that was queued. So calling the free() at the end of smb2_close() frees this memory that is still referenced in a PDU callback. Then during the call to smb2_destroy_context when destroying the client, the PDUs in the queues are cancelled and all callback are called. But as the one for the close operation was already freed, it crashes !

What could be done to fix these issues? Reverting these 2 commits is maybe not the best fix, as one of them was added to fix a memory leak apparently. In my case I'll fix locally already, because a leak is better than a crash.

Hey there, any news? Because first the commit that i replaced the t_socket was a attenpt to clear a warning also reverting one of them maybe in fact should be the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants