-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always resume automatic sessions on remote (BugFix) #859
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #859 +/- ##
==========================================
+ Coverage 36.11% 36.47% +0.35%
==========================================
Files 310 310
Lines 34631 34621 -10
Branches 5968 5963 -5
==========================================
+ Hits 12506 12627 +121
+ Misses 21561 21426 -135
- Partials 564 568 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few preliminary comments here and there
a3646d3
to
e6110c5
Compare
2149d3d
to
0178e8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, beside them this is pretty much ready to land. Thanks for adding all those tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty exciting PR, especially if it fixes this long-standing issue of session resuming "instability"!
Here is how I tested your branch. This is a long explanation for my future self (or anyone who wish to test this branch, really), but you can jump to the "Tests" section below.
TL;DR: I found a problem when reconnecting to the agent after the ongoing job is finished.
Setup
Checkbox controller
On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:
(venv) $ git switch solve-resume-on-remote
I use this venv for the Checkbox controller.
Checkbox agent
For the Checkbox agent, I create an LXC container running 22.04:
$ lxc launch images:ubuntu/22.04 jammy
$ lxc shell jammy
The rest of the commands are run in the container:
# apt install python3.10-venv python3-virtualenv git
# git clone https://github.com/canonical/checkbox.git
# cd checkbox/
# git switch solve-resume-on-remote
I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in /root/checkbox/checkbox-ng/venv/bin/checkbox-cli
and the providers are in described in /root/checkbox/checkbox-ng/venv/share/plainbox-providers-1
.
I put the following in /etc/systemd/system/checkbox-ng.service
:
[Unit]
Description=Checkbox Remote Service
Wants=network.target
[Service]
ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent
SyslogIdentifier=checkbox-ng.service
Environment="XDG_CACHE_HOME=/var/cache/"
Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1"
Restart=always
RestartSec=1
TimeoutStopSec=30
Type=simple
[Install]
WantedBy=multi-user.target
and I install the checkbox-ng service and start it:
# systemctl daemon-reload
# systemctl enable checkbox-ng.service
Now, everything is in place. I can start a remote session from the controller by running:
(venv) $ checkbox-cli control <IP of my lxc container>
Sample jobs and test plan
In the 22.04 container, I create a new pieq.pxu
file in /root/checkbox/providers/base/units/
and put the following in it:
unit: job
id: pieq/test
command:
for i in $(seq 1 30);
do
echo "Iteration $i/30..."
sleep 1
done
flags: simple noreturn
unit: job
id: pieq/wrapup
command:
echo "Wrapping up..."
flags: simple
unit: test plan
id: pieq
_name: pieq
include:
pieq/test
pieq/wrapup
the pieq/test
job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the noreturn
flag, but of course you can remove this flag if you want to test other use cases.
I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:
# systemctl restart checkbox-ng.service
Launcher
In order to simulate a non-interactive test run, I create the following launcher file (pieq.launcher
):
[launcher]
launcher_version = 1
app_id = com.canonical.certification:PR859
stock_reports = text
[test plan]
unit = com.canonical.certification::pieq
forced = yes
[test selection]
forced = yes
[ui]
type = silent
[transport:outfile]
type = stream
stream = stdout
[exporter:text]
unit = com.canonical.plainbox::text
[report:screen]
transport = outfile
exporter = text
To run it from the controller side with:
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher
Tests
Resuming a non-interactive session after simulating a DUT crash ✔️
- Run Checkbox remote using the launcher, which starts
pieq/test
(which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher
→ The test starts running
- Simulate a DUT crash by stopping and starting the LXC container:
lxc stop jammy && lxc start jammy
→ the session resumes, the test is marked as passed and the test run proceeds to the next test before finalizing the session:
Connecting to 10.146.223.75:18871. Timeout: 600s
-----------------------------[ Running job 1 / 2 ]------------------------------
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Iteration 1/30...
Iteration 2/30...
Iteration 3/30...
Iteration 4/30...
Iteration 5/30...
Iteration 6/30...
Iteration 7/30...
Iteration 8/30...
Connection lost!
connection closed by peer
Reconnecting ...
Reconnected (took: 0s)
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Outcome: job passed
-----------------------------[ Running job 2 / 2 ]------------------------------
--------------------------------[ pieq/wrapup ]---------------------------------
ID: com.canonical.certification::pieq/wrapup
Category: Uncategorised
--------------------------------------------------------------------------------
Wrapping up...
--------------------------------------------------------------------------------
Outcome: job passed
==================================[ Results ]===================================
job passed : pieq/test
job passed : pieq/wrapup
If the job pieq/test
does not have the noreturn
flag, the only difference is that the job will be marked as "crashed" when resuming the session.
→ All good!
Reconnecting to agent after the controller stopped/crashed ❌
One of the issue this should fix is #22 , which mentions
While testing is ongoing, restart your host computer.
So:
- Run Checkbox remote using the launcher, which starts
pieq/test
(which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher
→ The test starts running
- Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:
(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop']
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...
aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup
), and never finishes. This is because the job has finished running by the time we reconnect to the agent.
Running the previous tests using an interactive session ❌
If I try to do the same as before, but with an interactive session (that is, without providing the launcher and instead running checkbox-cli control <IP of my lxc container>
and selecting the test plan and the tests to run, as commonly done by QA team), the session is never resumed as expected, and instead I am back to the "Select test plan" screen.
This is probably what the documentation explains:
Otherwise, it waits for a Controller to connect and chose what to do.
except the "what to do" just means "forget the on-going session and start a new one". As I understand, there is still some work by @Hook25 to get this working.
@pieqq For the third one, there's another set of stories that ought to land this week that solve that. This PR is about reliable automated runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, well done!
9485ebb
to
4a18dab
Compare
4a18dab
to
7c99070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
* add mermaid flowchart of agent boot * always resume automated sessions via remote * fix restart logic requiring SA_RESTARTABLE * update the url to the stable PPA * improve coverage of remote assistant code * add more coverage to the remote assistant * py35 friendly assertion * add coverage for sa.config property * outline checking for open port * add tests for registering agents args * more agent unit tests * more agent unit tests * add coverage for the job type when resuming * add tests for SessionAssistantAgent * tweaks after Max's suggestions * enable mermaid in sphinx * translate the mermaid doc from md to rst * remove the md version of the agent bootup diagram * add paragraph describing agent resume * smaller diamond * fix typo and quote code better * exclude mermaid blocks from spellcheck; sort list * surrender to pyspelling * use better words for system hangs * correct the exclusion of mermaid content in spelling check * properly handle no launcher in app_blob + UT * bring back gettext call and mock it in UT * smaller diamond redux * move to function-level mock * remove "padme" from the lxd_profile once again * remove unnecessary debs from MB profile * remove bionic from the special conditions * use simpler comparison
Description
This PR changes how Checkbox handles automatic session crashes when using Checkbox Agent.
Previously if a job crashed or hang, there was no way to continue the session - when the Controller reconnected to the respawned Agent, it started the session afresh.
With the change I'm proposing here, the Checkbox Agent always resumes the automatic (silent) sessions. The previous job's outcome is decided depending on the existence of the
noreturn
flag. But the session always moves forward. It means that the sesion should never just disappear or keep rerunning the same job.Resolved issues
Fixes: #22
Fixes: #722
Fixes: CHECKBOX-796
Fixes: CHECKBOX-867
Documentation
I'm providing description of the agent bootup together with a flow chart.
Tests
I've covered the code with unit tests and I'm proposing a Metabox scenario here.