Always resume automatic sessions on remote (BugFix) #859

kissiel · 2023-12-01T14:37:15Z

Description

This PR changes how Checkbox handles automatic session crashes when using Checkbox Agent.

Previously if a job crashed or hang, there was no way to continue the session - when the Controller reconnected to the respawned Agent, it started the session afresh.

With the change I'm proposing here, the Checkbox Agent always resumes the automatic (silent) sessions. The previous job's outcome is decided depending on the existence of the noreturn flag. But the session always moves forward. It means that the sesion should never just disappear or keep rerunning the same job.

Resolved issues

Fixes: #22
Fixes: #722
Fixes: CHECKBOX-796
Fixes: CHECKBOX-867

Documentation

I'm providing description of the agent bootup together with a flow chart.

Tests

I've covered the code with unit tests and I'm proposing a Metabox scenario here.

codecov · 2023-12-01T14:39:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (32448d5) 36.11% compared to head (0561d60) 36.47%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #859      +/-   ##
==========================================
+ Coverage   36.11%   36.47%   +0.35%     
==========================================
  Files         310      310              
  Lines       34631    34621      -10     
  Branches     5968     5963       -5     
==========================================
+ Hits        12506    12627     +121     
+ Misses      21561    21426     -135     
- Partials      564      568       +4

Flag	Coverage Δ
checkbox-ng	`62.58% <100.00%> (+0.87%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Hook25

A few preliminary comments here and there

checkbox-ng/checkbox_ng/launcher/agent.py

checkbox-ng/plainbox/impl/session/agent_bootup.md

Hook25

A few minor comments, beside them this is pretty much ready to land. Thanks for adding all those tests

checkbox-ng/checkbox_ng/launcher/test_agent.py

checkbox-ng/plainbox/impl/session/remote_assistant.py

checkbox-ng/plainbox/impl/session/test_remote_assistant.py

docs/explanation/remote.rst

metabox/metabox/lxd_profiles/checkbox.profile

pieqq

This is a pretty exciting PR, especially if it fixes this long-standing issue of session resuming "instability"!

Here is how I tested your branch. This is a long explanation for my future self (or anyone who wish to test this branch, really), but you can jump to the "Tests" section below.

TL;DR: I found a problem when reconnecting to the agent after the ongoing job is finished.

Setup

Checkbox controller

On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:

(venv) $ git switch solve-resume-on-remote

I use this venv for the Checkbox controller.

Checkbox agent

For the Checkbox agent, I create an LXC container running 22.04:

$ lxc launch images:ubuntu/22.04 jammy
$ lxc shell jammy

The rest of the commands are run in the container:

# apt install python3.10-venv python3-virtualenv git
# git clone https://github.com/canonical/checkbox.git
# cd checkbox/
# git switch solve-resume-on-remote

I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in /root/checkbox/checkbox-ng/venv/bin/checkbox-cli and the providers are in described in /root/checkbox/checkbox-ng/venv/share/plainbox-providers-1.

I put the following in /etc/systemd/system/checkbox-ng.service:

[Unit]
Description=Checkbox Remote Service
Wants=network.target

[Service]
ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent
SyslogIdentifier=checkbox-ng.service
Environment="XDG_CACHE_HOME=/var/cache/"
Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1"
Restart=always
RestartSec=1
TimeoutStopSec=30
Type=simple

[Install]
WantedBy=multi-user.target

and I install the checkbox-ng service and start it:

# systemctl daemon-reload
# systemctl enable checkbox-ng.service

Now, everything is in place. I can start a remote session from the controller by running:

(venv) $ checkbox-cli control <IP of my lxc container>

Sample jobs and test plan

In the 22.04 container, I create a new pieq.pxu file in /root/checkbox/providers/base/units/ and put the following in it:

unit: job
id: pieq/test
command:
 for i in $(seq 1 30);
 do
     echo "Iteration $i/30..."
     sleep 1
 done
flags: simple noreturn

unit: job
id: pieq/wrapup
command:
 echo "Wrapping up..."
flags: simple

unit: test plan
id: pieq
_name: pieq
include:
    pieq/test
    pieq/wrapup

the pieq/test job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the noreturn flag, but of course you can remove this flag if you want to test other use cases.

I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:

# systemctl restart checkbox-ng.service

Launcher

In order to simulate a non-interactive test run, I create the following launcher file (pieq.launcher):

[launcher]
launcher_version = 1
app_id = com.canonical.certification:PR859
stock_reports = text

[test plan]
unit = com.canonical.certification::pieq
forced = yes

[test selection]
forced = yes

[ui]
type = silent

[transport:outfile]
type = stream
stream = stdout

[exporter:text]
unit = com.canonical.plainbox::text

[report:screen]
transport = outfile
exporter = text

To run it from the controller side with:

(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

Tests

Resuming a non-interactive session after simulating a DUT crash ✔️

Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):

(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

Simulate a DUT crash by stopping and starting the LXC container:

lxc stop jammy && lxc start jammy

→ the session resumes, the test is marked as passed and the test run proceeds to the next test before finalizing the session:

Connecting to 10.146.223.75:18871. Timeout: 600s
-----------------------------[ Running job 1 / 2 ]------------------------------
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Iteration 1/30...
Iteration 2/30...
Iteration 3/30...
Iteration 4/30...
Iteration 5/30...
Iteration 6/30...
Iteration 7/30...
Iteration 8/30...
Connection lost!
connection closed by peer
Reconnecting ...
Reconnected (took: 0s)
---------------------------------[ pieq/test ]----------------------------------
ID: com.canonical.certification::pieq/test
Category: Uncategorised
--------------------------------------------------------------------------------
Outcome: job passed
-----------------------------[ Running job 2 / 2 ]------------------------------
--------------------------------[ pieq/wrapup ]---------------------------------
ID: com.canonical.certification::pieq/wrapup
Category: Uncategorised
--------------------------------------------------------------------------------
Wrapping up...
--------------------------------------------------------------------------------
Outcome: job passed
==================================[ Results ]===================================
  job passed   : pieq/test
  job passed   : pieq/wrapup

If the job pieq/test does not have the noreturn flag, the only difference is that the job will be marked as "crashed" when resuming the session.

→ All good!

Reconnecting to agent after the controller stopped/crashed ❌

One of the issue this should fix is #22 , which mentions

While testing is ongoing, restart your host computer.

So:

Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):

(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:

(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop'] 
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...

aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.

Running the previous tests using an interactive session ❌

If I try to do the same as before, but with an interactive session (that is, without providing the launcher and instead running checkbox-cli control <IP of my lxc container> and selecting the test plan and the tests to run, as commonly done by QA team), the session is never resumed as expected, and instead I am back to the "Select test plan" screen.

This is probably what the documentation explains:

Otherwise, it waits for a Controller to connect and chose what to do.

except the "what to do" just means "forget the on-going session and start a new one". As I understand, there is still some work by @Hook25 to get this working.

kissiel · 2023-12-12T09:33:58Z

@pieqq
from the 3 experiments you've done, the middle one is about controller bugs more than what this PR is about.

For the third one, there's another set of stories that ought to land this week that solve that.

This PR is about reliable automated runs.

checkbox-ng/checkbox_ng/launcher/agent.py

checkbox-ng/checkbox_ng/launcher/checkbox_cli.py

Hook25

+1, well done!

pieqq

Nice! Thanks!

Hook25

+1

Hook25

+1

* add mermaid flowchart of agent boot * always resume automated sessions via remote * fix restart logic requiring SA_RESTARTABLE * update the url to the stable PPA * improve coverage of remote assistant code * add more coverage to the remote assistant * py35 friendly assertion * add coverage for sa.config property * outline checking for open port * add tests for registering agents args * more agent unit tests * more agent unit tests * add coverage for the job type when resuming * add tests for SessionAssistantAgent * tweaks after Max's suggestions * enable mermaid in sphinx * translate the mermaid doc from md to rst * remove the md version of the agent bootup diagram * add paragraph describing agent resume * smaller diamond * fix typo and quote code better * exclude mermaid blocks from spellcheck; sort list * surrender to pyspelling * use better words for system hangs * correct the exclusion of mermaid content in spelling check * properly handle no launcher in app_blob + UT * bring back gettext call and mock it in UT * smaller diamond redux * move to function-level mock * remove "padme" from the lxd_profile once again * remove unnecessary debs from MB profile * remove bionic from the special conditions * use simpler comparison

kissiel changed the title ~~Solve resume on remote~~ Solve resume on remote (BugFix) Dec 1, 2023

Hook25 reviewed Dec 7, 2023

View reviewed changes

kissiel force-pushed the solve-resume-on-remote branch from a3646d3 to e6110c5 Compare December 7, 2023 16:20

kissiel requested a review from Hook25 December 11, 2023 14:09

kissiel marked this pull request as ready for review December 11, 2023 14:09

kissiel force-pushed the solve-resume-on-remote branch from 2149d3d to 0178e8c Compare December 11, 2023 14:21

Hook25 requested changes Dec 11, 2023

View reviewed changes

pieqq requested changes Dec 12, 2023

View reviewed changes

kissiel changed the title ~~Solve resume on remote (BugFix)~~ Always resume automatic sessions on remote (BugFix) Dec 12, 2023

pieqq mentioned this pull request Dec 13, 2023

Session does not continue if reconnecting to an agent after the controller stopped/crashed #888

Open

pieqq requested changes Dec 13, 2023

View reviewed changes

checkbox-ng/checkbox_ng/launcher/agent.py Outdated Show resolved Hide resolved

checkbox-ng/checkbox_ng/launcher/checkbox_cli.py Show resolved Hide resolved

Hook25 previously approved these changes Dec 13, 2023

View reviewed changes

kissiel dismissed Hook25’s stale review via 9485ebb December 13, 2023 11:58

kissiel force-pushed the solve-resume-on-remote branch 2 times, most recently from 9485ebb to 4a18dab Compare December 13, 2023 12:03

kissiel added 13 commits December 13, 2023 13:04

add mermaid flowchart of agent boot

030c69e

always resume automated sessions via remote

4e3e2a3

fix restart logic requiring SA_RESTARTABLE

ff142ae

update the url to the stable PPA

3a893e3

improve coverage of remote assistant code

3fb2682

add more coverage to the remote assistant

2097a28

py35 friendly assertion

c9a3626

add coverage for sa.config property

1e803c7

outline checking for open port

bb43415

add tests for registering agents args

0e57c83

more agent unit tests

23d6751

more agent unit tests

ad13be1

add coverage for the job type when resuming

f2af9e9

kissiel added 14 commits December 13, 2023 13:05

translate the mermaid doc from md to rst

50d650d

remove the md version of the agent bootup diagram

97b6a21

add paragraph describing agent resume

1199051

smaller diamond

3ae2ade

fix typo and quote code better

b8958fc

exclude mermaid blocks from spellcheck; sort list

d27283a

surrender to pyspelling

bc6ca8b

use better words for system hangs

ce581bb

correct the exclusion of mermaid content in spelling check

fd2dc8a

properly handle no launcher in app_blob + UT

6e1531f

bring back gettext call and mock it in UT

e4f770a

smaller diamond redux

1ae5954

move to function-level mock

b5f71d4

remove "padme" from the lxd_profile once again

7c99070

kissiel force-pushed the solve-resume-on-remote branch from 4a18dab to 7c99070 Compare December 13, 2023 12:05

pieqq previously approved these changes Dec 13, 2023

View reviewed changes

remove unnecessary debs from MB profile

2b0ed4d

kissiel dismissed pieqq’s stale review via 2b0ed4d December 13, 2023 12:18

Hook25 previously approved these changes Dec 13, 2023

View reviewed changes

remove bionic from the special conditions

e5559f0

kissiel dismissed Hook25’s stale review via e5559f0 December 13, 2023 13:49

use simpler comparison

0561d60

Hook25 approved these changes Dec 13, 2023

View reviewed changes

kissiel merged commit 48fc045 into main Dec 13, 2023
20 checks passed

kissiel deleted the solve-resume-on-remote branch December 13, 2023 14:09

This was referenced Jan 2, 2024

Checkbox remote-slave.service be killed by oom-killer during memory stress-ng testing #719

Closed

update the url to the stable PPA (Infra) #860

Closed

pieqq mentioned this pull request Jan 10, 2024

LP1927663: [Checkbox deb] the checkbox session won't resume automatically while running power-automated test plan #20

Closed

LiaoU3 mentioned this pull request Jan 15, 2024

Auto resume is not working in edge channel #935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always resume automatic sessions on remote (BugFix) #859

Always resume automatic sessions on remote (BugFix) #859

kissiel commented Dec 1, 2023 •

edited

Loading

codecov bot commented Dec 1, 2023 •

edited

Loading

Hook25 left a comment

Hook25 left a comment

pieqq left a comment

kissiel commented Dec 12, 2023

Hook25 left a comment

pieqq left a comment

Hook25 left a comment

Hook25 left a comment

Always resume automatic sessions on remote (BugFix) #859

Always resume automatic sessions on remote (BugFix) #859

Conversation

kissiel commented Dec 1, 2023 • edited Loading

Description

Resolved issues

Documentation

Tests

codecov bot commented Dec 1, 2023 • edited Loading

Codecov Report

Hook25 left a comment

Choose a reason for hiding this comment

Hook25 left a comment

Choose a reason for hiding this comment

pieqq left a comment

Choose a reason for hiding this comment

Setup

Checkbox controller

Checkbox agent

Sample jobs and test plan

Launcher

Tests

Resuming a non-interactive session after simulating a DUT crash ✔️

Reconnecting to agent after the controller stopped/crashed ❌

Running the previous tests using an interactive session ❌

kissiel commented Dec 12, 2023

Hook25 left a comment

Choose a reason for hiding this comment

pieqq left a comment

Choose a reason for hiding this comment

Hook25 left a comment

Choose a reason for hiding this comment

Hook25 left a comment

Choose a reason for hiding this comment

kissiel commented Dec 1, 2023 •

edited

Loading

codecov bot commented Dec 1, 2023 •

edited

Loading