-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent SEGFAULT
s on consecutive exec_command()
invocations
#658
base: devel
Are you sure you want to change the base?
Conversation
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
1de5bc0
to
251bb6d
Compare
SEGFAULT
s on consecutive exec_command()
invocations
@Jakuje it looks like this is making some CI jobs get stuck: https://github.com/ansible/pylibssh/actions/runs/11910858999/job/33210235668?pr=658 / https://github.com/ansible/pylibssh/actions/runs/11910858999/job/33210009382?pr=658. |
I restarted said jobs, but this is something to look into, as it'll probably make the CI flakier if merged. |
@Jakuje the rawhide test failure log is similar to those in GHA: https://download.copr.fedorainfracloud.org/results/packit/ansible-pylibssh-658/fedora-rawhide-x86_64/08279621-python-ansible-pylibssh/builder-live.log.gz. |
OTOH, it's also unstable on |
9bc197c
to
9a6e7d6
Compare
So the failure @ https://github.com/ansible/pylibssh/actions/runs/11935006968/job/33265619653?pr=658#step:15:181 is referring to https://github.com/ansible/pylibssh/blob/6cbf2dd/src/pylibsshext/channel.pyx#L45. This is rather bizzare. How did it get in there?
|
a2fcf3c
to
9a007ea
Compare
I have the feeling there is some issue in how the references are counted or some timing/synchronization issue. The current code works for me locally as well as in manylinux, but fails in rpm build of Fedora 40+ fails already with the first command:
but on Fedora 39 fails on the second one:
|
Seems like I can see the issue now. The problem is that the
While naively, I would expect the first two messages should come in reversed direction. But I do not see this mandated anywhere in the specs. The openssh sends this message basically async, when the child process dies: My suspicion is that this is a timing/scheduling issue and when openssh sends the messages in this order, we will not process both output and exit code and close the channel earlier than we could read the output. Pushed a change that should address this issue, I hope. |
This looks like it did not help. Reading through the code and through the logs, another data point might be that the |
This sounds like it finally made it through. Now, there is just the broken ubi, which I likely broke with the removal of the pytest-forked. |
The error in ubi looked like this:
(not sure if this is the relevant part though) It worked last month, but I do not see any logs from the ubi runs. lets try with the pytest-forked (with some cleanup of the commit series) |
now, rpmfind is down causing the ubi build failure ... probably giving up for today, but it should be in the shape for a reviews |
The Protocol=2 is default and the current implementation does not even support any other protocol version. Using this just makes it confusing to readers Signed-off-by: Jakub Jelen <[email protected]>
When the OpenSSH server runs in daemon mode, the logs are written into syslog. This is not suitable for tests as we can not see what is going on there. The captured logs are only from the main sshd process and does not show any useful information about the connection and what we actually test. The trick to get debug logs is either to not run in deamon mode, but it would accept only one connection, or use the `-E` to write logs from all connections to the given file. Signed-off-by: Jakub Jelen <[email protected]>
Even though the PAM is required for the server to correctly work and integrate well to the OS, for testing, PAM introduces needless complexity and it is much better to not to use it. The tests are running under non-root user so the OpenSSH won't use PAM anyway. Signed-off-by: Jakub Jelen <[email protected]>
This simplifies debugging of the failed tests. With previous changes, we have already debug logs from the server. This prints also debug logs from the client and allows us to align sent and received protocol messages. Signed-off-by: Jakub Jelen <[email protected]>
Previously, stack-local variables held the callback structure. This worked as long as the call stack was not overridden by other function calls or as long as the server closed the channel very early. In other cases, the delayed close in libssh could result in calling the callbacks even after we believed the channel was freed, causing invalid memory access and crashes. The accompanying change to libssh was merged to avoid calling callbacks on channels the caller considers freed here: https://gitlab.com/libssh/libssh-mirror/-/merge_requests/549/ But we will be using older libssh versions for some time so we need a workaround in pylibssh too. Fixes ansible#57 Signed-off-by: Jakub Jelen <[email protected]>
Surprisingly, the SSH protocol does not define if the exit-code notification can come before or after the data. Depending on the server, scheduler or current mood, the exit-code might arrive earlier than the data and if we close the channel right away, we will lose some output. This changes the code to correctly wait for the EOF before closing the channel and returning. Signed-off-by: Jakub Jelen <[email protected]>
Depending on scheduler, the ssh_channel_request_exec() might already process some of the data from the peer and we could lose some data from peer. By setting the callback before calaling this function, we make sure callbacks are correctly triggered on all the data. Signed-off-by: Jakub Jelen <[email protected]>
Removes the failed marker as it works now. Signed-off-by: Jakub Jelen <[email protected]>
Signed-off-by: Jakub Jelen <[email protected]>
Signed-off-by: Jakub Jelen <[email protected]>
Signed-off-by: Jakub Jelen <[email protected]>
Signed-off-by: Jakub Jelen <[email protected]>
Quality Gate passedIssues Measures |
Ok, so now only the ubi is broken (and not because of the |
SUMMARY
The function
exec_command()
keeps the callbacks as a local variable before assigning them to the created channel. The channel is not guaranteed to be completely freed whenssh_channel_free()
is called because there might be some leftover messages or responses to process (close confirmation, exit code ...).Calling the
exec_command()
as done previously in the test from the same function without anything in between (except assert) will likely map the second function call to the same memory on the call stack so it was working most of the time. But calling it from different functions or contexts will likely change the call stack and processing of outstanding callbacks is more likely to result in addressing wrong memory location.Likely fixes #57, #645 and #657
ISSUE TYPE
ADDITIONAL INFORMATION
I was not able to reproduce the issue locally so pushing to see if the CI will be able to crash.
This is also introducing memory leaks as the callback structure is never freed. We should probably store it somewhere in the python code before returning to make sure it is not garbage collected (or can the python GC track the callback pointer is still stored on the libssh side?).