Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always resume automatic sessions on remote (BugFix) #859

Merged
merged 33 commits into from
Dec 13, 2023

Conversation

kissiel
Copy link
Contributor

@kissiel kissiel commented Dec 1, 2023

Description

This PR changes how Checkbox handles automatic session crashes when using Checkbox Agent.

Previously if a job crashed or hang, there was no way to continue the session - when the Controller reconnected to the respawned Agent, it started the session afresh.

With the change I'm proposing here, the Checkbox Agent always resumes the automatic (silent) sessions. The previous job's outcome is decided depending on the existence of the noreturn flag. But the session always moves forward. It means that the sesion should never just disappear or keep rerunning the same job.

Resolved issues

Fixes: #22
Fixes: #722
Fixes: CHECKBOX-796
Fixes: CHECKBOX-867

Documentation

I'm providing description of the agent bootup together with a flow chart.

Tests

I've covered the code with unit tests and I'm proposing a Metabox scenario here.

@kissiel kissiel changed the title Solve resume on remote Solve resume on remote (BugFix) Dec 1, 2023
Copy link

codecov bot commented Dec 1, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (32448d5) 36.11% compared to head (0561d60) 36.47%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #859      +/-   ##
==========================================
+ Coverage   36.11%   36.47%   +0.35%     
==========================================
  Files         310      310              
  Lines       34631    34621      -10     
  Branches     5968     5963       -5     
==========================================
+ Hits        12506    12627     +121     
+ Misses      21561    21426     -135     
- Partials      564      568       +4     
Flag Coverage Δ
checkbox-ng 62.58% <100.00%> (+0.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few preliminary comments here and there

checkbox-ng/checkbox_ng/launcher/agent.py Show resolved Hide resolved
checkbox-ng/checkbox_ng/launcher/agent.py Outdated Show resolved Hide resolved
checkbox-ng/checkbox_ng/launcher/agent.py Show resolved Hide resolved
checkbox-ng/checkbox_ng/launcher/agent.py Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/agent_bootup.md Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/agent_bootup.md Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/agent_bootup.md Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/agent_bootup.md Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/agent_bootup.md Outdated Show resolved Hide resolved
@kissiel kissiel requested a review from Hook25 December 11, 2023 14:09
@kissiel kissiel marked this pull request as ready for review December 11, 2023 14:09
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, beside them this is pretty much ready to land. Thanks for adding all those tests

checkbox-ng/checkbox_ng/launcher/test_agent.py Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/remote_assistant.py Outdated Show resolved Hide resolved
checkbox-ng/plainbox/impl/session/test_remote_assistant.py Outdated Show resolved Hide resolved
docs/explanation/remote.rst Outdated Show resolved Hide resolved
metabox/metabox/lxd_profiles/checkbox.profile Outdated Show resolved Hide resolved
Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty exciting PR, especially if it fixes this long-standing issue of session resuming "instability"!

Here is how I tested your branch. This is a long explanation for my future self (or anyone who wish to test this branch, really), but you can jump to the "Tests" section below.

TL;DR: I found a problem when reconnecting to the agent after the ongoing job is finished.

Setup

Checkbox controller

On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:

(venv) $ git switch solve-resume-on-remote

I use this venv for the Checkbox controller.

Checkbox agent

For the Checkbox agent, I create an LXC container running 22.04:

$ lxc launch images:ubuntu/22.04 jammy
$ lxc shell jammy

The rest of the commands are run in the container:

# apt install python3.10-venv python3-virtualenv git
# git clone https://github.com/canonical/checkbox.git
# cd checkbox/
# git switch solve-resume-on-remote

I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in /root/checkbox/checkbox-ng/venv/bin/checkbox-cli and the providers are in described in /root/checkbox/checkbox-ng/venv/share/plainbox-providers-1.

I put the following in /etc/systemd/system/checkbox-ng.service:

[Unit]
Description=Checkbox Remote Service
Wants=network.target

[Service]
ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent
SyslogIdentifier=checkbox-ng.service
Environment="XDG_CACHE_HOME=/var/cache/"
Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1"
Restart=always
RestartSec=1
TimeoutStopSec=30
Type=simple

[Install]
WantedBy=multi-user.target

and I install the checkbox-ng service and start it:

# systemctl daemon-reload
# systemctl enable checkbox-ng.service

Now, everything is in place. I can start a remote session from the controller by running:

(venv) $ checkbox-cli control <IP of my lxc container>

Sample jobs and test plan

In the 22.04 container, I create a new pieq.pxu file in /root/checkbox/providers/base/units/ and put the following in it:

unit: job
id: pieq/test
command:
 for i in $(seq 1 30);
 do
     echo "Iteration $i/30..."
     sleep 1
 done
flags: simple noreturn

unit: job
id: pieq/wrapup
command:
 echo "Wrapping up..."
flags: simple

unit: test plan
id: pieq
_name: pieq
include:
    pieq/test
    pieq/wrapup

the pieq/test job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the noreturn flag, but of course you can remove this flag if you want to test other use cases.

I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:

# systemctl restart checkbox-ng.service

Launcher

In order to simulate a non-interactive test run, I create the following launcher file (pieq.launcher):

[launcher]
launcher_version = 1
app_id = com.canonical.certification:PR859
stock_reports = text

[test plan]
unit = com.canonical.certification::pieq
forced = yes

[test selection]
forced = yes

[ui]
type = silent

[transport:outfile]
type = stream
stream = stdout

[exporter:text]
unit = com.canonical.plainbox::text

[report:screen]
transport = outfile
exporter = text

To run it from the controller side with:

(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

Tests

Resuming a non-interactive session after simulating a DUT crash ✔️

  1. Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

  1. Simulate a DUT crash by stopping and starting the LXC container:
lxc stop jammy && lxc start jammy

→ the session resumes, the test is marked as passed and the test run proceeds to the next test before finalizing the session:

Connecting to 10.146.223.75:18871. Timeout: 600s
-----------------------------[ Running job 1 / 2 ]------------------------------
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Iteration 1/30...
Iteration 2/30...
Iteration 3/30...
Iteration 4/30...
Iteration 5/30...
Iteration 6/30...
Iteration 7/30...
Iteration 8/30...
Connection lost!
connection closed by peer
Reconnecting ...
Reconnected (took: 0s)
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Outcome: job passed
-----------------------------[ Running job 2 / 2 ]------------------------------
--------------------------------[ pieq/wrapup ]---------------------------------
ID: com.canonical.certification::pieq/wrapup
Category: Uncategorised
--------------------------------------------------------------------------------
Wrapping up...
--------------------------------------------------------------------------------
Outcome: job passed
==================================[ Results ]===================================
  job passed   : pieq/test
  job passed   : pieq/wrapup

If the job pieq/test does not have the noreturn flag, the only difference is that the job will be marked as "crashed" when resuming the session.

→ All good!

Reconnecting to agent after the controller stopped/crashed ❌

One of the issue this should fix is #22 , which mentions

While testing is ongoing, restart your host computer.

So:

  1. Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

  1. Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:
(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop'] 
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...

aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.

Running the previous tests using an interactive session ❌

If I try to do the same as before, but with an interactive session (that is, without providing the launcher and instead running checkbox-cli control <IP of my lxc container> and selecting the test plan and the tests to run, as commonly done by QA team), the session is never resumed as expected, and instead I am back to the "Select test plan" screen.

This is probably what the documentation explains:

Otherwise, it waits for a Controller to connect and chose what to do.

except the "what to do" just means "forget the on-going session and start a new one". As I understand, there is still some work by @Hook25 to get this working.

@kissiel
Copy link
Contributor Author

kissiel commented Dec 12, 2023

@pieqq
from the 3 experiments you've done, the middle one is about controller bugs more than what this PR is about.

For the third one, there's another set of stories that ought to land this week that solve that.

This PR is about reliable automated runs.

@kissiel kissiel changed the title Solve resume on remote (BugFix) Always resume automatic sessions on remote (BugFix) Dec 12, 2023
Hook25
Hook25 previously approved these changes Dec 13, 2023
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, well done!

pieqq
pieqq previously approved these changes Dec 13, 2023
Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks!

Hook25
Hook25 previously approved these changes Dec 13, 2023
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kissiel kissiel merged commit 48fc045 into main Dec 13, 2023
20 checks passed
@kissiel kissiel deleted the solve-resume-on-remote branch December 13, 2023 14:09
binli pushed a commit to binli/checkbox that referenced this pull request Mar 22, 2024
* add mermaid flowchart of agent boot

* always resume automated sessions via remote

* fix restart logic requiring SA_RESTARTABLE

* update the url to the stable PPA

* improve coverage of remote assistant code

* add more coverage to the remote assistant

* py35 friendly assertion

* add coverage for sa.config property

* outline checking for open port

* add tests for registering agents args

* more agent unit tests

* more agent unit tests

* add coverage for the job type when resuming

* add tests for SessionAssistantAgent

* tweaks after Max's suggestions

* enable mermaid in sphinx

* translate the mermaid doc from md to rst

* remove the md version of the agent bootup diagram

* add paragraph describing agent resume

* smaller diamond

* fix typo and quote code better

* exclude mermaid blocks from spellcheck; sort list

* surrender to pyspelling

* use better words for system hangs

* correct the exclusion of mermaid content in spelling check

* properly handle no launcher in app_blob + UT

* bring back gettext call and mock it in UT

* smaller diamond redux

* move to function-level mock

* remove "padme" from the lxd_profile once again

* remove unnecessary debs from MB profile

* remove bionic from the special conditions

* use simpler comparison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants