Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #244 Initial draft of performance testing page including summarized case study of DSPT use of APDEX #245

Merged
merged 31 commits into from
May 3, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b8a1646
Update quality-checks.md
andyblundell Feb 14, 2022
e366a92
Issue #244 - Initial notes on performance tests
chrisclarknhsnet Feb 16, 2022
40ced37
Update quality-checks.md
andyblundell Feb 16, 2022
1a61268
Update testing.md
andyblundell Feb 16, 2022
5a65c97
Update performancetesting.md
andyblundell Feb 16, 2022
aa694ba
Update performancetesting.md
andyblundell Feb 16, 2022
2a50771
Update performancetesting.md
andyblundell Feb 16, 2022
090fa53
Update performancetesting.md
andyblundell Feb 16, 2022
edbd056
Update performancetesting.md
andyblundell Feb 16, 2022
0da6874
Update performancetesting.md
andyblundell Feb 16, 2022
d76ac2e
Added paragraph outlining the impact the approach had by reference to…
chrisclarknhsnet Feb 17, 2022
7f5e1b5
Merge branch 'Apdex' of https://github.com/NHSDigital/software-engine…
chrisclarknhsnet Feb 17, 2022
8fc9194
Update testing.md
andyblundell Feb 17, 2022
7226250
Update performancetesting.md
andyblundell Feb 17, 2022
7741405
Update performancetesting.md
andyblundell Feb 17, 2022
75f36bd
Update performancetesting.md
andyblundell Feb 17, 2022
76ae606
Merge branch 'Apdex' of https://github.com/NHSDigital/software-engine…
chrisclarknhsnet Feb 17, 2022
2b41bab
Renaming of files and directory structure as per Dan's comments
chrisclarknhsnet Feb 17, 2022
1406af5
Redo of Andy's changes into the renamed performance testing file
chrisclarknhsnet Feb 17, 2022
0e65d52
Update quality-checks.md
andyblundell Feb 17, 2022
6c6eb7b
Fixed a link
andyblundell Feb 17, 2022
1733fc9
Update performance-testing.md
andyblundell Feb 18, 2022
61bd1e8
Added reference to engineering dashboards
andyblundell Feb 18, 2022
2cd9672
Draft of an "impact" section at the end
andyblundell Feb 18, 2022
405a8ca
Update performance-testing.md
andyblundell Feb 18, 2022
0a62fa8
Removed link to an image
andyblundell Feb 18, 2022
67cf900
Removed an image
andyblundell Feb 18, 2022
4f9438c
Redacted end point information
chrisclarknhsnet May 3, 2022
001c19c
Merge from main
chrisclarknhsnet May 3, 2022
ffcf4f8
Re-adding APDEX links to Quality-checks page after re-organisation of…
chrisclarknhsnet May 3, 2022
2db01df
Removing errant spaces and pipe seperators from quality_checks.md for…
chrisclarknhsnet May 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.DS_Store
*.code-workspace
!project.code-workspace
.vs/*
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added practices/dsptcasestudy-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added practices/dsptcasestudy-degradationoutput.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added practices/dsptcasestudy-jmeteroutput.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added practices/dsptcasestudy-scenarios.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added practices/jmeter-reportsample.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
112 changes: 112 additions & 0 deletions practices/performancetesting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Performance Testing
andyblundell marked this conversation as resolved.
Show resolved Hide resolved

## Context

* These notes are part of a broader set of [principles](../principles.md)
* This is related to [Engineering quality-checks](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/adopt-appropriate-cyber-security-standards)
* Related community of practice: [Test Automation Working Group](../communities/pd-test-automation-working-group.md)
* See also:
* [Quality Metrics](../quality-checks.md)
* [Continuous integration](continuous-integration.md)
* [Governance as a side effect](../patterns/governance-side-effect.md)
* [Testing](testing.md)

## Introduction

Performance testing has a somewhat ambiguous meaning across the IT industry and is often used interchangeably with other testing terms such as load testing, stress testing, soak testing, etc.

For the sake of clarity this page will consider Performance Testing as per the definition on [Wikipedia](https://en.wikipedia.org/wiki/Software_performance_testing), namely:

> performance testing is in general a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.

## How to start?

### Know your audience

* Identify common user interactions or journeys with your system
* Identify how many users are typically accessing your system at any given moment
* Calculate or estimate what percentage of those users will be performing a given interaction or journey at any given moment
* This information can then be used to design your thread groups in JMeter or similar grouping of interactions with other testing tools
* The information can then also be used to determine a "typical" load as well as being useful to realistically scale up load as part of your tests

### What does good look like?

* Identify clear targets for performance: performance testing should be an **objective** not subjective exercise
* Examples of possible targets might be:
* SLA based, e.g. all pages must respond within 4 seconds
* Relative, e.g. any given release must not deteriorate performance by more than 5%
* Weighted by interaction: if a user performs a particular interaction once every 3 months they are liable to be more accepting of a 8 second delay than a task which they perform many times a day
* Weighted by load: in busy periods you may be willing to have a slightly longer response time
* Consider how your targets may be influenced by your architecture - for example if you are using a serverless "scale on demand" architecture your targets might be cost based

Ultimately your targets are a red flag that you need to investigate further

## Use of the APDEX index

[APDEX](https://en.wikipedia.org/wiki/Apdex) is a simple formula for calculating performance based on a target response time which would satisfy your users. The reason it is useful is that it gives a definite figure between 0 and 1.0, where 1.0 means all of your users are happy and satisfied, and 0 means they are all unhappy and dissatisfied.

APDEX acts as a "smoothing" function and helps ameliorate the effect of outliers by purely classing performance times in terms of whether the user is satisfied, tolerating or frustrated. Therefore, if you have a strict SLA around every page response time it may not be appropriate for you to use. It is also important to choose a realistic target response time as otherwise, if it is overly lenient or overly generous, you will struggle to make much distinction between different performance test runs. Repeated results of 0 or 1.0 aren't very useful.

APDEX is a useful index for pipelines as it gives a definite figure and is therefore a very objective measure as opposed to the more subjective, manual interpretation of standard load testing reports such as the JMeter example below:

![Alt](./jmeter-reportsample.png "Sample JMeter Report")

As such APDEX can help us answer and take action (e.g. fail the pipeline) on such fundamental questions as:

* Are our users happy?
* Have we made performance worse?
* Would our users become unhappy when there is an increased load?

## A case study

For the Data Security Protection Toolkit (DSPT) we decided to use APDEX so that, prior to a fortnightly release, we could answer the question:

> Has this release made the performance of the system worse?

### A case study (know your audience)
<div style="float:right">
andyblundell marked this conversation as resolved.
Show resolved Hide resolved
<img src="./dsptcasestudy-scenarios.png" style=" alt="DSPT user scenarios" />
</div>
<div style="padding-right:30px;padding-bottom:30px;">
Previously we had defined a list of user scenarios for the typical actions undertaken on the system which we named after Mr Men characters. We also defined for every 100 users how many (i.e. what percentage) would be likely to be performing a given Mr Man scenario
</div>

We used these scenarios to define our thread groups within JMeter and decided we would run our performance tests for 250 users at a time, which would represent a heavy load for the system.

### A case study (what does good look like?)

We decided that we wanted to know if the performance for a particular scenario had degraded by more than 5% compared to previous average performance. If it did we wanted to fail the pipeline so we could investigate any new pieces of code further.

### A case study (approach)
andyblundell marked this conversation as resolved.
Show resolved Hide resolved

Although you can apply APDEX figures to JMeter it only calculates them per endpoint whereas we wanted to aggregate our APDEX figures at the Mr Man scenario level.

We therefore wrote a Python program which would take the raw JMeter results file (a sample of shown below) and using regular expressions to group results by thread names matching a Mr Man scenario, would calculate the aggregate APDEX score per scenario.
andyblundell marked this conversation as resolved.
Show resolved Hide resolved

![Alt](./dsptcasestudy-jmeteroutput.png "Sample of raw JMeter result file")

The Python program produces a file with the output below:

![Alt](./dsptcasestudy-aggregatedapdexscores.png "Aggregated APDEX results file")

These figures were compared against the average by scenario of previous results files which had been stored in an S3 bucket by using an Athena database over the S3 bucket and the following query:

> SELECT type, key, avg(apdex) AS average FROM "dspt"."performance_test_results" GROUP BY type, key

Using the results of this query we could calculate any deterioration and fail the pipeline if needed. If the results were within the 5% limit then the results file was simply added to the S3 bucket. Additionally for information the results of the calculation are written to the Jenkins log as shown below:

![Alt](./dsptcasestudy-degradationoutput.png "Degradation result in Jenkins log")

### A case study - some caveats

Whilst we have found this approach useful there are certain caveats to it, for example:

* What if performance slowly degrades over time but always by less than 5% per release?
* It potentially hides an individual page or end point whose performance has degraded due to the aggregation/smoothing effect of APDEX.
* Recognition that we probably want in the future to also apply an absolute target e.g. APDEX of >= 0.9

### A case study - architecture

The following diagram summarises the approach taken to using APDEX by DSPT:

![Alt](./dsptcasestudy-architecture.png "DSPT Performance test architecture")
3 changes: 2 additions & 1 deletion practices/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
* [Continuous integration](continuous-integration.md)
* [Governance as a side effect](../patterns/governance-side-effect.md)
* [Quality Metrics](../quality-checks.md)
* [Performance Testing](performancetesting.md)

## General Testing Principles

Expand Down Expand Up @@ -78,7 +79,7 @@

* BDD tools to encode acceptance criteria in business terms as automated tests where appropriate.
* Chaos engineering / resilience testing e.g. using AWS Fault Injection Simulator (see [AWS FIS](../tools/aws-fis) for sample code)
* Performance tools to check load, volume, soak and stress limits
* Performance tools to check load, volume, soak and stress limits (see [Performance Testing practices](performancetesting.md) for further details)

## Further reading and resources

Expand Down
2 changes: 1 addition & 1 deletion quality-checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ We recommend tracking progress on an Engineering Quality dashboard, for example:
| Security code analysis | Security | Universal | Check for indications of possible security issues (for example injection weaknesses) | This gives fast feedback about security issues. <br/><br/> Code analysis is not as thorough as security testing in terms of finding complex weaknesses or issues that only manifest themselves at runtime, but it has much greater coverage. It's a better option for finding simple weaknesses and it's much quicker to execute. <br/><br/> Security code analysis and security testing are both important to achieve rapid and thorough security testing. | If using SonarQube, must use SonarQube's default [rules, profiles and gateways](tools/sonarqube.md#default-quality-gates) <br/><br/> Build pipeline must fail if gateway is not met | One option is [SonarQube](tools/sonarqube.md). For the purpose of security code analysis, Developer Edition or higher is required as it includes advanced OWASP scanning. | |
| Security testing | Security | Contextual | Check for security issues (for example injection weaknesses) | More thorough than security code scanning, but much slower to execute, so both are important to achieve both rapid and thorough security testing | | | |
| Dependency scanning | Security | Universal | Check for security issues and vulnerabilities in dependent areas of code that are outside of our direct control | Without this we have no way of knowing of any issues or security vulnerabilities of third party components that we are not responsible for | Must check against CVE database <br/><br/>Must check dependencies of dependencies <br/><br/>Must fail build if any [High](https://www.imperva.com/learn/application-security/cve-cvss-vulnerability/) severity vulnerabilities are found <br/><br/>It should be easy to determine why the build failed: which vulnerability it was, and in which top-level dependency <br/><br/>Tools must include ability to exclude accepted vulnerabilities. These should include a date at which the exclusion expires and the build fails again. These should include a description of why they are excluded | One option is (other options are being added): [dependency-check-maven](tools/dependency-check-maven/README.md) | |
| Performance tests | Resilience | Contextual | Check whether application performance is acceptable at different levels of load. This may include: <br/>* Baseline test (one-off) - to establish how the system interacts <br/>* Smoke test - to establish that the key functionality is working before performing longer tests <br/>* Regression test - run a suite of repeatable test cases to validate existing functionality <br/>* Load test - to understand the system behaviour under an expected load | Without these tests, we don't know how load will affect the performance of the application, or whether existing functionality has been broken. | | | |
| [Performance tests](./practices/performancetesting.md) | Resilience | Contextual | Check whether application performance is acceptable at different levels of load. This may include: <br/>* Baseline test (one-off) - to establish how the system interacts <br/>* Smoke test - to establish that the key functionality is working before performing longer tests <br/>* Regression test - run a suite of repeatable test cases to validate existing functionality <br/>* Load test - to understand the system behaviour under an expected load | Without these tests, we don't know how load will affect the performance of the application, or whether existing functionality has been broken. | The performance of the system must be scored at build time so that it can be tracked<br/><br/>Build pipeline must fail if performance does not meet the acceptable level | One option is to use [APDEX](https://en.wikipedia.org/wiki/Apdex) to quantify performance to a numeric value, and to use this value to pass/fail the build pipeline |[Performance test practices](./practices/performancetesting.md) |
| Capacity tests | Resilience | Contextual | Identify the application's breaking point in terms of an increasingly heavy load. Degradation may manifest itself as <br/>* throughput bottlenecks<br/>* increasing response times<br/>* error rates rising | Without this test, we don't know how much load the application can handle before the application breaks or degrades | | | |
| Stress tests | Resilience | Contextual | Check how the system performs under stress including <br/> * a level load near the maximum capacity for a prolonged period<br/> * sudden spikes in load with a lower baseline load | Without this test, we don't know if the application will begin to fail as a result of memory leaks, connection pool blocking etc. or will fail under a sharp increase in load triggered by adverts, news coverage or TV tea breaks | | | |
| Soak tests | Resilience | Contextual | Check whether sustained heavy load for a significantly extended period causes a problem such as memory leaks, loss of instances, database failovers etc. | Without this test, we don't know if application performance will suffer under prolonged heavy load, how stable the system is, how it performs without interventions. | | | |
Expand Down