Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent upgraded_at field keeps updating to current time #3263

Closed
juliaElastic opened this issue Feb 9, 2024 · 19 comments · Fixed by #3264
Closed

Agent upgraded_at field keeps updating to current time #3263

juliaElastic opened this issue Feb 9, 2024 · 19 comments · Fixed by #3264
Assignees
Labels
bug Something isn't working QA:Validated Validated by the QA Team

Comments

@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 9, 2024

Stack version 8.12.1 and possibly others.

There seems to be an issue of Agents upgraded_at field keep being updated to current time, and this results in Fleet UI now showing Upgrade available when it should, and Upgrade agent action being disabled, because Fleet UI doesn't consider agent upgradeable if the agent was updated in the last 10 minutes.

It's not clear yet if the issue is on fleet-server or agent side.

Reproduced on a fresh 8.12.1 cluster, by enrolling a 8.11.4 agent, upgrade to 8.12.0 and wait 10 minutes.
The agent is still not allowed to be upgraded again to 8.12.1, and the upgraded_at field looks recent, event though the last upgrade happened more than 10m ago.

Workaround:

  • force upgrade agents with the API "force": true flag, or
  • manually delete the upgrade_details:null value from agent docs
@juliaElastic juliaElastic added the bug Something isn't working label Feb 9, 2024
@juliaElastic
Copy link
Contributor Author

I found the bug, upgrade_details is set to null when the upgrade is complete, and the logic looks at len(agent.UpgradeDetails to decide if the previous agent doc had upgrade_details, which evaluates to true for null (len is 4), and so sets upgraded_at to now at every checkin.

@jlind23
Copy link
Contributor

jlind23 commented Feb 9, 2024

@juliaElastic does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1?
@amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1?
@kpollich @juliaElastic is there any way for us to automatically test this?

@kpollich
Copy link
Member

kpollich commented Feb 9, 2024

does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1?

An agent on 8.12.0 cannot be upgraded to 8.12.1 via Fleet UI currently without the workaround Julia drafted here: #3264 (comment).

is there any way for us to automatically test this?

We need an automated test on all release branches where an upgrade from an agent on the latest available patch release for that branch is upgraded to the build for the current HEAD of that release branch. e.g. on the 8.12 branch, we'd run an upgrade for an agent running the released 8.12.0 agent binary to the current 8.12.0-SNAPSHOT build built off the release branch.

Additionally, we could have a daily run that does the same using the daily snapshot build instead of a PR build.

@jlind23
Copy link
Contributor

jlind23 commented Feb 9, 2024

@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?

@amolnater-qasource
Copy link
Collaborator

Hi @jlind23

@amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1?

We don't have the documented testcase for this scenario and cover this as a part of exploratory testing.
Please let us know if should create a testcase for this.

Testing details:
While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.

We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).

Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Feb 12, 2024

We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).

@amolnater-qasource but once you were able to test 8.12.0 to 8.12.1 it worked right?

@amolnater-qasource
Copy link
Collaborator

@jlind23 We have revalidated on the released 8.12.1 and observed that we are not able to upgrade from the UI.

  • 8.11.4>8.12.0 upgraded successfully.
  • We waited for more than 20 minutes, however Agent upgrade option doesn't get enabled.
  • We are not able to upgrade from 8.12.0>8.12.1 from bulk actions too.

Screenshots/Recordings:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-02-12.14-45-10.mp4

Please let us know if anything else is required from our end.
Thanks!!

@kpollich
Copy link
Member

While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.

Thanks @amolnater-qasource this makes sense as the next patch release isn't published during the BC phase and thus won't be shown in Fleet UI (maybe something we can file an enhancement for). To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test? If so can you share a summary of the test steps used as well? My Testrail access has lapsed as I don't log in frequently 🙃, otherwise I would check myself. Many thanks.

@cmacknz
Copy link
Member

cmacknz commented Feb 12, 2024

@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?

We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.

We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.

@jlind23
Copy link
Contributor

jlind23 commented Feb 12, 2024

@amolnater-qasource We looked at testrail with @kpollich and it looks like the test case below does not exist:

We are not able to upgrade from 8.12.0>8.12.1 from bulk actions too.

Can we make sure this is added please?

@kpollich
Copy link
Member

We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.

We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.

Thanks, Craig. I think the agent test coverage is sufficient here and consecutive updates aren't something we should pursue adding. The coverage gaps lie elsewhere in Fleet.

@amolnater-qasource
Copy link
Collaborator

To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test?

@kpollich yes the direct 8.12.0>8.12.1 BC1 was successful, which is part of our testcases using below API under the Dev tools:

POST kbn:/api/fleet/agents/<agent-id>/upgrade
{
  "version": "8.12.1"
}
  • We haven't upgraded this 8.12.0 agent from 8.11.4.

Further, even on 8.12.1 released kibana environment, we are successfully able to upgrade 8.12.0>8.12.1 from Fleet UI.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-02-12.20-25-50.mp4

If so can you share a summary of the test steps used as well?

We do not have any testcase for upgrading the agents twice like from 8.11.4> 8.12.0> 8.12.1

However, we have testcases from one lower version from all OS's:

Please let us know if anything else is required from our end.

cc: @jlind23
Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Feb 12, 2024

@kpollich @juliaElastic according to #3263 (comment) it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?

@kpollich
Copy link
Member

Thanks @amolnater-qasource - this is extremely helpful in understanding our existing test coverage here.

according to #3263 (comment) it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?

I can confirm this is working as expected. I created a fresh 8.12.1 cloud instance (which naturally deploys a Fleet Server on 8.12.1 as well), then enrolled an agent running on 8.12.0, find my observations below:

  1. Upgrade available filter behaves as expected immediately following enrollment:

image

  1. "Upgrade" action is available, and triggering and upgrade immediately applies to the agent as expected

image

  1. Granular upgrade state is reported as expected

image

  1. Upgrade completes as expected

image


So, if I'm understanding the smoke tests properly, we wouldn't have caught this issue in smoke tests. In our smoke tests, we create a cloud instance on the latest release, then enroll an agent on the previous release, then attempt to upgrade it. To confirm this, I performed the same steps as above, but initially enrolled an agent on 8.11.4, then upgraded that agent to 8.12.0 instead of starting with a fresh 8.12.0 agent. Observations below:

  1. Agent enrolls successfully, shows Upgrade available badge as expected when on 8.11.4

image

  1. Agent is upgradeable to 8.12.0 and 8.12.1, updates to 8.12.0 as expected

image

image

  1. Agent is not upgradeable to 8.12.1, even after waiting the requisite 10 minutes for the upgrade rate limit - this is the bug described in this issue

image


So, in order to catch this bug in the QAS smoke tests, we would've needed to test a sequential upgrade from 8.11.4 -> 8.12.0 -> 8.12.1, or in generic terms Previous minor's latest patch -> Current minor -> Current minor's latest patch. Codifying this into a regression test seems like a good idea, but it's hard decide what the test case should be. When we go to test 8.12.2 should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2 or 8.11.4 -> 8.12.1 -> 8.12.2?

@cmacknz
Copy link
Member

cmacknz commented Feb 12, 2024

When we go to test 8.12.2 should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2 or 8.11.4 -> 8.12.1 -> 8.12.2?

I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.

This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.

The agent test framework we use can provide the guarantee that the agent half of the upgrade works as expected, so you don't need to reverify that.

Using mock agents would also allow you to have them do adversarial things like make requests with incorrect and out of order upgrade details. While Fleet shouldn't have to verify the agent part of the contact, it also shouldn't assume the agent will never have a bug in how it talks to Fleet and it should defend itself against that.

@kpollich
Copy link
Member

I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.

Yes I should clarify: either scenario would reproduce this issue, but I meant to codify this process for future test runs. Using the latest versions sounds good to me. We'd codify this in TestRail as follows, to be run on patch releases

Previous minor's latest release -> Current release - 1 patch -> Current release

e.g. 8.11.4 -> 8.12.1 -> 8.12.2

For minors, we'd stick with

Previous minor's latest release -> Current release

e.g. 8.11.4 -> 8.12.0

This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.

I agree that ultimately this case should be covered in Fleet Server tests. There are substantial barriers to handling this in Kibana CI (we need to spawn "real" agents off of snapshot builds, for example) that don't exist in Fleet Server.

Spawning a live Kibana server in Fleet Server CI is a good idea, but I don't know that we do that today. I know that's how the agent tests we're talking about work, so we also do this in Fleet Server for better test fidelity.

I'm working on capturing all of this in a RCA doc that I'll send out later today, then we'll meet tomorrow as a group to make sure we're aligned on next steps.

@amolnater-qasource
Copy link
Collaborator

Hi Team,

We have created 01 testcase for this scenario under our fleet test suite at link:

Please let us know if anything else is required from our end.
Thanks!

@juliaElastic
Copy link
Contributor Author

Tested locally with fleet-server (8.12.0) and agent (8.11.4) enrolled (using multipass VMs).

  • Used agent binary download source https://snapshots.elastic.co/8.12.2-5f8ffc93/downloads
  • First upgraded the agent to 8.12.2-SNAPSHOT by manually typing in the version.
  • Upgrade fleet-server to 8.12.2-SNAPSHOT
  • Wait until upgrade is finished + 10m, and Upgrade agent action is enabled for both fleet-server and agent
image image image

@amolnater-qasource
Copy link
Collaborator

Hi Team,

We have revalidated this issue on 8.12.2 BC1 kibana cloud environment and had below observations:

Observations:

  • 8.11.4 agent upgraded successfully to 8.12.1
  • Using API we are successfully able to upgrade 8.12.1>8.12.2.
    POST kbn:/api/fleet/agents/#agent-id#/upgrade
    {
    "version": "8.12.2"
    }

Logs:
elastic-agent-diagnostics-2024-02-21T09-38-41Z-00.zip

Build details:
VERSION: 8.12.2
BUILD: 70281
COMMIT: f5bd489c5ff9c676c4f861c42da6ea99ae350832

Hence, we are marking this issue as QA:Validated.

Please let us know if we are missing anything here.
Thanks!

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working QA:Validated Validated by the QA Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants