-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assume installed agent on upgrade #5879
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks reasonable.
I wish we would have more tests that document behavior of some functions (unit tests should be enough for this)
I also would like to find out why this problem slipped through integration tests:
Are we missing a test?
The current tests are not run correctly ?
Are we not waiting long enough to see a rollback as @cmacknz was describing in #5872 (comment) ?
…agent into bug/windows-paths
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Why doesn't our existing 7.17.x -> 8.16 integration test fail? We need to identify that and fix it as part of this work, it's the root cause for why we are finding this bug so late. |
@@ -92,3 +94,123 @@ func TestHasPrefixWindows(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
func TestResolveControlSocketWithInstalledState(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I appreciate that you added a unit test.
So, we only have one case where ResolveControlSocketWithInstalledState
returns a different result than ControlSocketFromPath
and it's for Windows when the agent is detected as installed...
I was more thinking about having a small UT for ResolveControlSocket() where we can prepare a temp directory to be the top path with and without the .installed
marker and having explicit expected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand what you're trying to achieve.
i'd rather address this in a followup as we may hit the window for next release with actual state of the PR and tests which should be good enough to get us covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResolveControlSocket()
is only called from a single place, and that code already calls paths.RunningInstalled()
right before ResolveControlSocket()
.
So IMO ResolveControlSocket
just should always take whether we know we are installed as an argument and we can avoid having two functions. Then having the installed detection is never a part of this functions job, so we don't need to test that part here at all.
Just reading the test cases here it is really hard to get a picture of what is supposed to happen, fundamentally:
- When on Windows, whether you are installed or not changes the result.
- On Darwin and Linux, being installed makes no difference.
It's not clear why we have 3 test cases for each platform to me, when there are naturally only 2.
It feels like the tests could be simpler.
P.S. 8.16 is likely going to be delayed and there will be another BC sometime next week. If there isn't, this PR has already missed the release. So let's not worry about landing this and instead focus on whether we are happy with it as that is already out of our control until we know the next BC date.
Updated the grace period configuration for the fast watcher to 100 seconds. The reason we previously missed detecting failures was due to our failure threshold of 5 connection errors. For the fast watcher, each retry takes 15 seconds, so 5 retries would take 75 seconds. Since the original grace period was set to 60 seconds, the retries would not complete within that time, meaning we didn’t encounter a failure within the grace period and assumed the connection was stable. With the new 100-second grace period, we now allow enough time for these retries to occur and detect failures if they happen. I verified that with this updated configuration (and without the fix introduced in this PR), the test fails, confirming the configuration change allows us to catch connection issues as intended. |
integration tests failing on missing template fixed in this PR created by @ycombinator |
Quality Gate passedIssues Measures |
What does this PR do?
This PR assumes installed agent on upgrade.
Detecting installed based on marker file does not work prior to 8.8 as marker file is not there leading to incorrect control path.
During upgrade we resolve paths to proper value assuming upgrading agent is installed (as this is requirement anyways).
Other way would be to keep
runningInstalled
evaluation based on paths as before OR have two ways of detecting installed agent during upgrade based on previous agent version.Why is it important?
To make ugprade from <8.8 work
Fixes #5872