Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nss): fix authd daemon hanging on shutdown #113

Merged
merged 2 commits into from
Nov 27, 2023

Conversation

GabrielNagy
Copy link
Contributor

NSS clients keep the library loaded until the end of the session, e.g. if one runs a command like sudo sleep 10, the library won't be closed until the sleep finishes. This also applies to SSH and login sessions - easy to monitor by applying a inotify watch on the NSS library path.

One side effect of this was that due to the way we were initializing our tokio runtime, gRPC clients would be left hanging until the end of the aforementioned session, terminating together with the runtime. We assumed that the gRPC connection would go out of scope and terminate after the NSS method returns, but it appears this was not the case.

Leaving clients hanging meant that as long as there was at least one open session (be it either a logged in user, a running sudo command, or any other blocking third-party program interacting with NSS), the authd daemon wouldn't be able to gracefully stop in a quick manner, thus hanging the process and ultimately timing out after a period of time.

Because tonic (our Rust gRPC library) doesn't provide a bespoke way of closing the client connection, we still need to rely on Rust to handle it for us, so the proposed way of doing it is to declare the tokio runtime inside the function call, which will make it go out of scope and terminate the client afer it returns. We initialize the runtime the same way, ensuring we remain single-threaded by using the current-thread scheduler.

Fixes UDENG-1598

@GabrielNagy GabrielNagy force-pushed the fix-hanging-grpc-clients branch from 959bf03 to 2b54c31 Compare November 24, 2023 13:12
@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2023

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (95a9a07) 89.36% compared to head (d1b02a5) 88.39%.

Files Patch % Lines
nss/src/group/mod.rs 50.00% 9 Missing ⚠️
nss/src/passwd/mod.rs 50.00% 9 Missing ⚠️
nss/src/shadow/mod.rs 50.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #113      +/-   ##
==========================================
- Coverage   89.36%   88.39%   -0.97%     
==========================================
  Files          28       28              
  Lines        2003     2043      +40     
==========================================
+ Hits         1790     1806      +16     
- Misses        160      184      +24     
  Partials       53       53              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GabrielNagy GabrielNagy marked this pull request as ready for review November 24, 2023 14:38
@GabrielNagy GabrielNagy requested a review from a team as a code owner November 24, 2023 14:38
Copy link
Member

@denisonbarbosa denisonbarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job figuring this out! I have some minor syntax comments that are quite easy to address.

We should hold the merge until we get it working and tested manually.

nss/src/group/mod.rs Outdated Show resolved Hide resolved
nss/src/lib.rs Show resolved Hide resolved
@GabrielNagy GabrielNagy changed the title Fix authd daemon hanging on shutdown fix(nss): fix authd daemon hanging on shutdown Nov 24, 2023
@GabrielNagy GabrielNagy force-pushed the fix-hanging-grpc-clients branch from 2b54c31 to 2ccedd5 Compare November 24, 2023 16:43
Copy link
Member

@denisonbarbosa denisonbarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we tested it together and it works! The daemon is finally quitting when it's supposed to.

Well done!

Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent research and finding! Ok, so it didn’t get out of scope implicitely and so, didn’t get dropped.

I have some questions though:

  • in the current situration, it seems that using the GRPC call in a runtime is then useless, can we completely drop it? (but see next one first)
  • the initial idea and fear was about creating a new GRPC connection for any nss calls. Did you log into a real user session, try command and graphical app and see how the session generally react on a low end machine doing all those NSS calls, each one creating a new GRPC connection? I’m a little bit afraid of the amount of closing/initial handshake and such.

The original idea of using the Tokio runtime was to let some connection happen, and then, after some idling time (like 5s without any NSS request, closing the NSS connection), I see that we don’t have the primitive to force closing the connection, but can we force dropping the object from scope? Do you think this is still achievable?

@GabrielNagy
Copy link
Contributor Author

in the current situration, it seems that using the GRPC call in a runtime is then useless, can we completely drop it? (but see next one first)

I thought of this as well, since it seems like a feature that doesn't really help us in this case as we want to remain single threaded. Unfortunately, tonic (the gRPC library) and tokio are closely entwined and there's no way to use the former in a non-async way, thus we have to do this "dance".

Did you log into a real user session, try command and graphical app and see how the session generally react on a low end machine doing all those NSS calls, each one creating a new GRPC connection? I’m a little bit afraid of the amount of closing/initial handshake and such.

Good point, I updated my VM to be below the recommended system requirements (1 vCPU / 2 GB RAM) and I have to say the entire experience is really snappy - I don't think this is something we should worry about.

The original idea of using the Tokio runtime was to let some connection happen, and then, after some idling time (like 5s without any NSS request, closing the NSS connection), I see that we don’t have the primitive to force closing the connection, but can we force dropping the object from scope? Do you think this is still achievable?

As mentioned above, I don't feel like over-optimizing at this point is really necessary given the seemingly low footprint of the gRPC handshake. That being said, this sounds like an improvement we might want to have at some point so I'm up for opening a card to track it.

Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thanks for testing on a real "slow" system :)

I agree if the perfs are correct, let’s not to pre-optimize. FYI, there is a card for it already (UDENG-1395), you can probably add some notes on current manual perf tests.

Copy link
Member

@denisonbarbosa denisonbarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite everything being great and working, we can improve the error handling on the runtime creation calls (more details on the line comment).

I'm sorry for not thinking about this before, but I'm glad for spotting it otherwise we could have some really bad behaviors after release.

nss/src/passwd/mod.rs Outdated Show resolved Hide resolved
NSS clients keep the library loaded until the end of the session, e.g.
if one runs a command like `sudo sleep 10`, the library won't be closed
until the sleep finishes. This also applies to SSH and login sessions -
easy to monitor by applying a inotify watch on the NSS library path.

One side effect of this was that due to the way we were initializing our
tokio runtime, gRPC clients would be left hanging until the end of the
aforementioned session, terminating together with the runtime. We
assumed that the gRPC connection would go out of scope and terminate
after the NSS method returns, but it appears this was not the case.

Leaving clients hanging meant that as long as there was at least one
open session (be it either a logged in user, a running sudo command, or
any other blocking third-party program interacting with NSS), the authd
daemon wouldn't be able to gracefully stop in a quick manner, thus
hanging the process and ultimately timing out after a period of time.

Because tonic (our Rust gRPC library) doesn't provide a bespoke way of
closing the client connection, we still need to rely on Rust to handle
it for us, so the proposed way of doing it is to declare the tokio
runtime inside the function call, which will make it go out of scope and
terminate the client afer it returns. We initialize the runtime the same
way, ensuring we remain single-threaded by using the current-thread
scheduler.
We do not use this anymore, but we still need the crate to hook up the
nss functions.
@GabrielNagy GabrielNagy force-pushed the fix-hanging-grpc-clients branch from 2ccedd5 to d1b02a5 Compare November 27, 2023 11:37
Copy link
Member

@denisonbarbosa denisonbarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are good for merging! Well done!

@GabrielNagy GabrielNagy merged commit ab26227 into main Nov 27, 2023
5 checks passed
@GabrielNagy GabrielNagy deleted the fix-hanging-grpc-clients branch November 27, 2023 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants