Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assume installed agent on upgrade #5879

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

michalpristas
Copy link
Contributor

What does this PR do?

This PR assumes installed agent on upgrade.
Detecting installed based on marker file does not work prior to 8.8 as marker file is not there leading to incorrect control path.
During upgrade we resolve paths to proper value assuming upgrading agent is installed (as this is requirement anyways).

Other way would be to keep runningInstalled evaluation based on paths as before OR have two ways of detecting installed agent during upgrade based on previous agent version.

Why is it important?

To make ugprade from <8.8 work

Fixes #5872

@michalpristas michalpristas added bug Something isn't working backport-8.x Automated backport to the 8.x branch with mergify labels Oct 30, 2024
@michalpristas michalpristas self-assigned this Oct 30, 2024
@michalpristas michalpristas requested a review from a team as a code owner October 30, 2024 10:21
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks reasonable.

I wish we would have more tests that document behavior of some functions (unit tests should be enough for this)

I also would like to find out why this problem slipped through integration tests:
Are we missing a test?
The current tests are not run correctly ?
Are we not waiting long enough to see a rollback as @cmacknz was describing in #5872 (comment) ?

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Oct 30, 2024

Why doesn't our existing 7.17.x -> 8.16 integration test fail? We need to identify that and fix it as part of this work, it's the root cause for why we are finding this bug so late.

@@ -92,3 +94,123 @@ func TestHasPrefixWindows(t *testing.T) {
})
}
}

func TestResolveControlSocketWithInstalledState(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I appreciate that you added a unit test.

So, we only have one case where ResolveControlSocketWithInstalledState returns a different result than ControlSocketFromPath and it's for Windows when the agent is detected as installed...

I was more thinking about having a small UT for ResolveControlSocket() where we can prepare a temp directory to be the top path with and without the .installed marker and having explicit expected results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand what you're trying to achieve.
i'd rather address this in a followup as we may hit the window for next release with actual state of the PR and tests which should be good enough to get us covered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResolveControlSocket() is only called from a single place, and that code already calls paths.RunningInstalled() right before ResolveControlSocket().

https://github.com/michalpristas/elastic-agent/blob/be8773ff18061ec0a7570fa43db902ff86e709fe/internal/pkg/agent/cmd/run.go#L713-L736

So IMO ResolveControlSocket just should always take whether we know we are installed as an argument and we can avoid having two functions. Then having the installed detection is never a part of this functions job, so we don't need to test that part here at all.

Just reading the test cases here it is really hard to get a picture of what is supposed to happen, fundamentally:

  1. When on Windows, whether you are installed or not changes the result.
  2. On Darwin and Linux, being installed makes no difference.

It's not clear why we have 3 test cases for each platform to me, when there are naturally only 2.

It feels like the tests could be simpler.

P.S. 8.16 is likely going to be delayed and there will be another BC sometime next week. If there isn't, this PR has already missed the release. So let's not worry about landing this and instead focus on whether we are happy with it as that is already out of our control until we know the next BC date.

@michalpristas
Copy link
Contributor Author

michalpristas commented Nov 1, 2024

Updated the grace period configuration for the fast watcher to 100 seconds.

The reason we previously missed detecting failures was due to our failure threshold of 5 connection errors. For the fast watcher, each retry takes 15 seconds, so 5 retries would take 75 seconds. Since the original grace period was set to 60 seconds, the retries would not complete within that time, meaning we didn’t encounter a failure within the grace period and assumed the connection was stable.

With the new 100-second grace period, we now allow enough time for these retries to occur and detect failures if they happen.

I verified that with this updated configuration (and without the fix introduced in this PR), the test fails, confirming the configuration change allows us to catch connection issues as intended.

@michalpristas
Copy link
Contributor Author

integration tests failing on missing template fixed in this PR created by @ycombinator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows agent fall backs to 7.17.25 on upgrade to 8.16.0 BC2.
5 participants