Check enrollment status before authenticating agents #2861

joshdover · 2023-08-03T16:52:25Z

In cases where Agents are unenrolled from the backend without being uninstalled on the host, Agent may continue to check in with Fleet Server or fetch artifacts from the API. In this case, we incur the costs of attempting to hash the agent's API key on each request. Though we have caching in place, on a cache miss this is an expensive operation.

We can avoid this overhead by first checking the Agent's enrollment status and avoid trying to auth the key if it's already unenrolled in https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/auth.go

In addition, if the agent is enrolled but the API key is invalid, we should log an error so we can follow up on a potential bug or tampering of Fleet-managed API keys by the user.

We should return a 401 in either case, rather than a 400 to indicate an authentication issue rather than other types of bad requests.

intxgo · 2023-08-09T07:40:07Z

Hi @joshdover why HTTP 500, not 401? I've been reading through the linked issue and noticed that the final HTTP code reaching Endpoint is 400 instead of 401 and the response body conveying the error document is missing https://github.com/elastic/endpoint-dev/issues/12694#issuecomment-1551773846 It was grabbed only from the Fleet logs. Is it missing only in telemetry, or Endpoint also doesn't receive it?

The reason I was digging here is that I believe that Endpoint should immediately drop API keys which results in 401 and stop any communication attempts without an API key. I believe that the keys are generated on the stack, passed to endpoints via policy, thus invalid API key is not going to ever work again. The approach with HTTP 500 and throttling might hide real issues experienced by endpoints configured correctly which we would rather want to see and solve immediately, no?

I've just checked ES in the cloud, it returns correct HTTP and response error document
curl -v https://0a87896b773a417e8cabe46fa3bc2d5b.us-west1.gcp.cloud.es.io

cc @juliaElastic

BTW, isn't typically DOS prevention, like throttling, implemented on the server/infra side?

joshdover · 2023-08-09T10:07:47Z

My reasoning was that if an API key is invalidated but Fleet's internal representation of the agent doesn't have the agent unenrolled (active: false), then we have a bug that we need to fix and I want that error to show up in our metrics in a way that we'll notice.

Thinking about this again, I think that's probably the wrong call for two reasons:

The client should get information about that its API key is likely never to work again, to trigger any throttling
We don't need to page engineers on such a bug if it does happen, we should log a message instead and follow up during normal business hours.

I agree we should return 401s insstead. I'll update the issue accordingly.

BTW, isn't typically DOS prevention, like throttling, implemented on the server/infra side?

Generally, yes that is also an option but I still think we should do the agent/endpoint-level throttling for two reasons:

It's more difficult to block these requests before the application because agents are usually all going to have their own IP addresses.
This is additional burden that self-managed customers have to manage. It's better if we can prevent this from becoming a problem in the first place under normal operation of the software.

Imagine a MSSP who has a customer churn, so they unenroll thousands of agents from Fleet, but the customer doesn't uninstall those agents from their hosts. The MSSP can't do much about this and configuring blocking at a proxy layer is more cumbersome since it could mean looking up thousands of unique API key IDs.

The agent should be conservative enough to not keep making requests that are likely never to succeed. We also want to account for potential temporary bugs, which is why we are choosing to throttle rather than completely stop attempting these requests in case a bug is resolved (or state is rolled back).

michel-laterman · 2023-10-13T21:48:48Z

What I'm understanding from this issue is that we should change the authAgent method to check the agent enrollment status before processing calling authAPIKey, is that correct?
(and also log errors if a deactivated key is used for an active agent)

The biggest issue I see with this is that the id value passed to authAgent can be nil, as is the case in the handleArtifacts endpoint.
For these cases we use the agent ID in the key to retrieve the agent record and ensure it's active/valid (and compare with the passed id if specified), this occurs after the key has been checked and authentication with the key has been verified.

We can change the behaviour of authAgent in order to reduce extra calls in cases where id is specified (checkin and ack requests), but it won't reduce calls for the artifacts, file upload, or the file delivery endpoints

joshdover · 2023-10-16T09:58:51Z

I had overlooked that the Agent doesn't send their id in artifacts requests. These are the requests where I've seen this problem happen the most.

In that case, I think it's not worth changing the authentication logic and instead just update the response code to 401 to make reporting on this issue simpler. The main fix on the agent/endpoint side to throttle requests on 401s.

joshdover added the Team:Fleet Label for the Fleet team label Aug 3, 2023

juliaElastic mentioned this issue Aug 4, 2023

Return http 500 if agent uses invalid API key and is not unenrolled #2854

Closed

jlind23 assigned michel-laterman Nov 30, 2023

michel-laterman mentioned this issue Nov 30, 2023

Unauthorized keys return 401 #3135

Merged

3 tasks

michel-laterman closed this as completed in #3135 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check enrollment status before authenticating agents #2861

Check enrollment status before authenticating agents #2861

joshdover commented Aug 3, 2023 •

edited

Loading

intxgo commented Aug 9, 2023 •

edited

Loading

joshdover commented Aug 9, 2023 •

edited

Loading

michel-laterman commented Oct 13, 2023

joshdover commented Oct 16, 2023

Check enrollment status before authenticating agents #2861

Check enrollment status before authenticating agents #2861

Comments

joshdover commented Aug 3, 2023 • edited Loading

intxgo commented Aug 9, 2023 • edited Loading

joshdover commented Aug 9, 2023 • edited Loading

michel-laterman commented Oct 13, 2023

joshdover commented Oct 16, 2023

joshdover commented Aug 3, 2023 •

edited

Loading

intxgo commented Aug 9, 2023 •

edited

Loading

joshdover commented Aug 9, 2023 •

edited

Loading