Invalid policy changes can be accepted because of ephemeral variable state #2887

faec · 2023-06-15T17:50:09Z

Agent's Coordinator regenerates its component model whenever it receives a change to its policy or variables. Whether this generation succeeds depends on both the variables and the policy -- some policy updates may succeed with one set of variables but not another. An example where this becomes a serious problem is the following (abbreviated) input config:

...
host: "${kubernetes.pod.ip}:1234"
condition: env_input_enabled = "true"

This always expands to a bad policy because the EQL syntax has an error: if the user wants to check their input flag, they need to use ${env_input_enabled} = "true" instead.

Now suppose this policy is sent to an Agent that doesn't yet know its value for kubernetes.pod.ip (or whatever other context variable the config depends on). Agent silently skips any inputs with missing variables, and it stops checking the rest of the policy as soon as it finds one, so the condition field isn't validated. This policy change will generate a valid component model that omits this input, and it will be reported to Fleet as successful.

If the Kubernetes metadata is then refreshed, producing new variables, Agent will try again to generate its component model, and will fail when it reaches condition. It will then enter an unhealthy state no matter what the values of the previously missing variables are.

The core problem here is that our AST processing that generates the component model depends on the current values of the variables -- this error could be detected and reported when we first receive the policy change, but we only verify the parts of the policy that are in active use. Instead, we should validate/preprocess the whole policy regardless of what the variables are, leaving the variable substitution for last, so we know that we can still produce a well-formed component model for any variables we are given. (This doesn't guarantee that the resulting components will always be healthy, but it guarantees that we at least have an unambiguous configuration to give them.)

Note: this issue had different symptoms prior to 8.8. In older versions, invalid EQL syntax wasn't reported as an error, but instead silently evaluated to false (changed in this PR). In that case, this policy wouldn't report an explicit error, but would instead silently skip the configured input no matter what variables were set.

Related issues:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-06-15T17:50:12Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

faec added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jun 15, 2023

faec mentioned this issue Jul 7, 2023

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

Open

faec mentioned this issue May 22, 2024

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Closed

3 tasks

faec mentioned this issue Oct 23, 2024

Elastic agent uses too much memory per Pod in k8s #5835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid policy changes can be accepted because of ephemeral variable state #2887

Invalid policy changes can be accepted because of ephemeral variable state #2887

faec commented Jun 15, 2023

elasticmachine commented Jun 15, 2023

Invalid policy changes can be accepted because of ephemeral variable state #2887

Invalid policy changes can be accepted because of ephemeral variable state #2887

Comments

faec commented Jun 15, 2023

elasticmachine commented Jun 15, 2023