-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check enrollment status before authenticating agents #2861
Comments
Hi @joshdover why HTTP 500, not 401? I've been reading through the linked issue and noticed that the final HTTP code reaching Endpoint is 400 instead of 401 and the response body conveying the error document is missing https://github.com/elastic/endpoint-dev/issues/12694#issuecomment-1551773846 It was grabbed only from the Fleet logs. Is it missing only in telemetry, or Endpoint also doesn't receive it? The reason I was digging here is that I believe that Endpoint should immediately drop API keys which results in 401 and stop any communication attempts without an API key. I believe that the keys are generated on the stack, passed to endpoints via policy, thus invalid API key is not going to ever work again. The approach with HTTP 500 and throttling might hide real issues experienced by endpoints configured correctly which we would rather want to see and solve immediately, no? I've just checked ES in the cloud, it returns correct HTTP and response error document BTW, isn't typically DOS prevention, like throttling, implemented on the server/infra side? |
My reasoning was that if an API key is invalidated but Fleet's internal representation of the agent doesn't have the agent unenrolled ( Thinking about this again, I think that's probably the wrong call for two reasons:
I agree we should return 401s insstead. I'll update the issue accordingly.
Generally, yes that is also an option but I still think we should do the agent/endpoint-level throttling for two reasons:
Imagine a MSSP who has a customer churn, so they unenroll thousands of agents from Fleet, but the customer doesn't uninstall those agents from their hosts. The MSSP can't do much about this and configuring blocking at a proxy layer is more cumbersome since it could mean looking up thousands of unique API key IDs. The agent should be conservative enough to not keep making requests that are likely never to succeed. We also want to account for potential temporary bugs, which is why we are choosing to throttle rather than completely stop attempting these requests in case a bug is resolved (or state is rolled back). |
What I'm understanding from this issue is that we should change the The biggest issue I see with this is that the We can change the behaviour of |
I had overlooked that the Agent doesn't send their In that case, I think it's not worth changing the authentication logic and instead just update the response code to 401 to make reporting on this issue simpler. The main fix on the agent/endpoint side to throttle requests on 401s. |
In cases where Agents are unenrolled from the backend without being uninstalled on the host, Agent may continue to check in with Fleet Server or fetch artifacts from the API. In this case, we incur the costs of attempting to hash the agent's API key on each request. Though we have caching in place, on a cache miss this is an expensive operation.
We can avoid this overhead by first checking the Agent's enrollment status and avoid trying to auth the key if it's already unenrolled in https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/auth.go
In addition, if the agent is enrolled but the API key is invalid, we should log an error so we can follow up on a potential bug or tampering of Fleet-managed API keys by the user.
We should return a 401 in either case, rather than a 400 to indicate an authentication issue rather than other types of bad requests.
The text was updated successfully, but these errors were encountered: