Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check enrollment status before authenticating agents #2861

Closed
joshdover opened this issue Aug 3, 2023 · 4 comments · Fixed by #3135
Closed

Check enrollment status before authenticating agents #2861

joshdover opened this issue Aug 3, 2023 · 4 comments · Fixed by #3135
Assignees
Labels
Team:Fleet Label for the Fleet team

Comments

@joshdover
Copy link
Contributor

joshdover commented Aug 3, 2023

In cases where Agents are unenrolled from the backend without being uninstalled on the host, Agent may continue to check in with Fleet Server or fetch artifacts from the API. In this case, we incur the costs of attempting to hash the agent's API key on each request. Though we have caching in place, on a cache miss this is an expensive operation.

We can avoid this overhead by first checking the Agent's enrollment status and avoid trying to auth the key if it's already unenrolled in https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/auth.go

In addition, if the agent is enrolled but the API key is invalid, we should log an error so we can follow up on a potential bug or tampering of Fleet-managed API keys by the user.

We should return a 401 in either case, rather than a 400 to indicate an authentication issue rather than other types of bad requests.

@intxgo
Copy link

intxgo commented Aug 9, 2023

Hi @joshdover why HTTP 500, not 401? I've been reading through the linked issue and noticed that the final HTTP code reaching Endpoint is 400 instead of 401 and the response body conveying the error document is missing https://github.com/elastic/endpoint-dev/issues/12694#issuecomment-1551773846 It was grabbed only from the Fleet logs. Is it missing only in telemetry, or Endpoint also doesn't receive it?

The reason I was digging here is that I believe that Endpoint should immediately drop API keys which results in 401 and stop any communication attempts without an API key. I believe that the keys are generated on the stack, passed to endpoints via policy, thus invalid API key is not going to ever work again. The approach with HTTP 500 and throttling might hide real issues experienced by endpoints configured correctly which we would rather want to see and solve immediately, no?

I've just checked ES in the cloud, it returns correct HTTP and response error document
curl -v https://0a87896b773a417e8cabe46fa3bc2d5b.us-west1.gcp.cloud.es.io

cc @juliaElastic

BTW, isn't typically DOS prevention, like throttling, implemented on the server/infra side?

@joshdover
Copy link
Contributor Author

joshdover commented Aug 9, 2023

My reasoning was that if an API key is invalidated but Fleet's internal representation of the agent doesn't have the agent unenrolled (active: false), then we have a bug that we need to fix and I want that error to show up in our metrics in a way that we'll notice.

Thinking about this again, I think that's probably the wrong call for two reasons:

  1. The client should get information about that its API key is likely never to work again, to trigger any throttling
  2. We don't need to page engineers on such a bug if it does happen, we should log a message instead and follow up during normal business hours.

I agree we should return 401s insstead. I'll update the issue accordingly.

BTW, isn't typically DOS prevention, like throttling, implemented on the server/infra side?

Generally, yes that is also an option but I still think we should do the agent/endpoint-level throttling for two reasons:

  • It's more difficult to block these requests before the application because agents are usually all going to have their own IP addresses.
  • This is additional burden that self-managed customers have to manage. It's better if we can prevent this from becoming a problem in the first place under normal operation of the software.

Imagine a MSSP who has a customer churn, so they unenroll thousands of agents from Fleet, but the customer doesn't uninstall those agents from their hosts. The MSSP can't do much about this and configuring blocking at a proxy layer is more cumbersome since it could mean looking up thousands of unique API key IDs.

The agent should be conservative enough to not keep making requests that are likely never to succeed. We also want to account for potential temporary bugs, which is why we are choosing to throttle rather than completely stop attempting these requests in case a bug is resolved (or state is rolled back).

@michel-laterman
Copy link
Contributor

What I'm understanding from this issue is that we should change the authAgent method to check the agent enrollment status before processing calling authAPIKey, is that correct?
(and also log errors if a deactivated key is used for an active agent)

The biggest issue I see with this is that the id value passed to authAgent can be nil, as is the case in the handleArtifacts endpoint.
For these cases we use the agent ID in the key to retrieve the agent record and ensure it's active/valid (and compare with the passed id if specified), this occurs after the key has been checked and authentication with the key has been verified.

We can change the behaviour of authAgent in order to reduce extra calls in cases where id is specified (checkin and ack requests), but it won't reduce calls for the artifacts, file upload, or the file delivery endpoints

@joshdover
Copy link
Contributor Author

I had overlooked that the Agent doesn't send their id in artifacts requests. These are the requests where I've seen this problem happen the most.

In that case, I think it's not worth changing the authentication logic and instead just update the response code to 401 to make reporting on this issue simpler. The main fix on the agent/endpoint side to throttle requests on 401s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants