-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(nss): fix authd daemon hanging on shutdown #113
Conversation
959bf03
to
2b54c31
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #113 +/- ##
==========================================
- Coverage 89.36% 88.39% -0.97%
==========================================
Files 28 28
Lines 2003 2043 +40
==========================================
+ Hits 1790 1806 +16
- Misses 160 184 +24
Partials 53 53 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job figuring this out! I have some minor syntax comments that are quite easy to address.
We should hold the merge until we get it working and tested manually.
2b54c31
to
2ccedd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we tested it together and it works! The daemon is finally quitting when it's supposed to.
Well done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent research and finding! Ok, so it didn’t get out of scope implicitely and so, didn’t get dropped.
I have some questions though:
- in the current situration, it seems that using the GRPC call in a runtime is then useless, can we completely drop it? (but see next one first)
- the initial idea and fear was about creating a new GRPC connection for any nss calls. Did you log into a real user session, try command and graphical app and see how the session generally react on a low end machine doing all those NSS calls, each one creating a new GRPC connection? I’m a little bit afraid of the amount of closing/initial handshake and such.
The original idea of using the Tokio runtime was to let some connection happen, and then, after some idling time (like 5s without any NSS request, closing the NSS connection), I see that we don’t have the primitive to force closing the connection, but can we force dropping the object from scope? Do you think this is still achievable?
I thought of this as well, since it seems like a feature that doesn't really help us in this case as we want to remain single threaded. Unfortunately, tonic (the gRPC library) and tokio are closely entwined and there's no way to use the former in a non-async way, thus we have to do this "dance".
Good point, I updated my VM to be below the recommended system requirements (1 vCPU / 2 GB RAM) and I have to say the entire experience is really snappy - I don't think this is something we should worry about.
As mentioned above, I don't feel like over-optimizing at this point is really necessary given the seemingly low footprint of the gRPC handshake. That being said, this sounds like an improvement we might want to have at some point so I'm up for opening a card to track it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thanks for testing on a real "slow" system :)
I agree if the perfs are correct, let’s not to pre-optimize. FYI, there is a card for it already (UDENG-1395), you can probably add some notes on current manual perf tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite everything being great and working, we can improve the error handling on the runtime creation calls (more details on the line comment).
I'm sorry for not thinking about this before, but I'm glad for spotting it otherwise we could have some really bad behaviors after release.
NSS clients keep the library loaded until the end of the session, e.g. if one runs a command like `sudo sleep 10`, the library won't be closed until the sleep finishes. This also applies to SSH and login sessions - easy to monitor by applying a inotify watch on the NSS library path. One side effect of this was that due to the way we were initializing our tokio runtime, gRPC clients would be left hanging until the end of the aforementioned session, terminating together with the runtime. We assumed that the gRPC connection would go out of scope and terminate after the NSS method returns, but it appears this was not the case. Leaving clients hanging meant that as long as there was at least one open session (be it either a logged in user, a running sudo command, or any other blocking third-party program interacting with NSS), the authd daemon wouldn't be able to gracefully stop in a quick manner, thus hanging the process and ultimately timing out after a period of time. Because tonic (our Rust gRPC library) doesn't provide a bespoke way of closing the client connection, we still need to rely on Rust to handle it for us, so the proposed way of doing it is to declare the tokio runtime inside the function call, which will make it go out of scope and terminate the client afer it returns. We initialize the runtime the same way, ensuring we remain single-threaded by using the current-thread scheduler.
We do not use this anymore, but we still need the crate to hook up the nss functions.
2ccedd5
to
d1b02a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we are good for merging! Well done!
NSS clients keep the library loaded until the end of the session, e.g. if one runs a command like
sudo sleep 10
, the library won't be closed until the sleep finishes. This also applies to SSH and login sessions - easy to monitor by applying a inotify watch on the NSS library path.One side effect of this was that due to the way we were initializing our tokio runtime, gRPC clients would be left hanging until the end of the aforementioned session, terminating together with the runtime. We assumed that the gRPC connection would go out of scope and terminate after the NSS method returns, but it appears this was not the case.
Leaving clients hanging meant that as long as there was at least one open session (be it either a logged in user, a running sudo command, or any other blocking third-party program interacting with NSS), the authd daemon wouldn't be able to gracefully stop in a quick manner, thus hanging the process and ultimately timing out after a period of time.
Because tonic (our Rust gRPC library) doesn't provide a bespoke way of closing the client connection, we still need to rely on Rust to handle it for us, so the proposed way of doing it is to declare the tokio runtime inside the function call, which will make it go out of scope and terminate the client afer it returns. We initialize the runtime the same way, ensuring we remain single-threaded by using the current-thread scheduler.
Fixes UDENG-1598