Harvester regression testing #91

RomanIakovlev · 2024-09-18T11:58:49Z

This PR is a work in progress. It aims to fix the #90 by adding a way to automatically harvest and compare definitions between prod and dev environments, summarizing and highlighting inconsistencies for a reviewer.

The overall approach I'm taking is to add more integration tests and modify the existing ones in tools/integration. I aim to achieve two improvements over the existing tests:

Currently only a fixed hardcoded list of coordinates is being tested for harvesting and comparison of definitions between prod and dev environments. I want to add variation into the existing suites by testing a dynamically obtained list of recently generated definitions from prod environment.
When comparing definitions between prod and dev, the generated output is not very human-friendly, especially when multiple definitions need to be compared. I want to add a way to summarize and classify the differences between definitions, so that a reviewer can quickly assess if current dev environment can be safely promoted to production.

You find more details about my implementation plan below

Making list of tested coordinates dynamic

To add variety to the test data, I plan to add support for querying recently modified definitions from production. Since there's currently no way to query the recently modified data from the ClearlyDefined API, I've added an Azure Function (see tools/harvester-forwarding/src/functions/getRecentDefinitions.js in this PR) to query CosmosDB and get a number of definition for each type (npm, maven, etc) that were created or updated recently. The function is called over HTTP and takes 2 parameters, days for how many days back to look, and limit for how many records of each type to collect.

I plan to use the output of that function in addition to the fixed coordinates list. I think we should have both fixed and dynamic list of coordinates to test, hence I plan to parameterize the existing tests to be able to pick either static or dynamic list in a given test run, and then add support for running existing tests in a matrix fashion for running the same tests on different coordinates lists.

Helping reviewers with definitions' comparison

There are existing tests that compare definitions between prod and dev environments. When results are not identical, the existing tests produce a standard line-by-line diff. It is hard to pinpoint the nature of the change by looking at that diff (whether it is an improvement or regression), and if we want to do this for multiple definitions, we need a better way.

For that I've implemented a comparison logic that classifies the differences between definitions into 3 categories, regression, improvement and inconclusive. It then groups all the differences into these categories and presents the output in the following way:

See details

{
  "overallResult": "inconclusive",
  "differences": {
    "inconclusive": [
      {
        "field": "described.tools",
        "diff": {
          "staging": [
            "clearlydefined/1.2.1",
            "reuse/3.2.1",
            "licensee/9.18.1",
            "scancode/32.3.0"
          ],
          "production": [
            "clearlydefined/1.2.0",
            "licensee/9.14.0",
            "scancode/30.3.0"
          ]
        }
      },
      {
        "field": "licensed.facets.core.discovered.unknown",
        "diff": {
          "staging": 1221,
          "production": 1224
        }
      },
      {
        "field": "licensed.facets.core.discovered.expressions",
        "diff": {
          "staging": [
            "Apache-2.0",
            "BSD-3-Clause",
            "LicenseRef-scancode-dco-1.1"
          ],
          "production": [
            "Apache-2.0",
            "BSD-3-Clause",
            "NOASSERTION"
          ]
        }
      }
    ],
    "improvement": [
      {
        "field": "licensed.facets.core.attribution.parties",
        "diff": {
          "addedElements": [
            "copyright 2013-2017 docker, inc.",
            "copyright (c) 2004, 2006 the linux foundation and its contributors"
          ]
        }
      }
    ]
  }
}

If you open the details, you may see there are 2 categories of differences:

inconclusive for paths described.tools, licensed.facets.core.discovered.unknown and licensed.facets.core.discovered.expressions
improvement for path licensed.facets.core.attribution.parties, since new attributions were found in dev compared to production.

The paths "_meta", "licensed.score", "licensed.toolScore", "described.score", "described.toolScore" were ignored, since changes there are to be expected.

I believe this output format is easier to read and understand. I plan to add a new test into the existing tools/integration/test/integration/e2e-test-service/definitionTest.js suite to produce such output for the definitions under test.

Please let me know what you think about these changes. If I get a green light I'll proceed with converting this PR into its final version.

CC @qtomlinson @elrayle

ljones140

This looks great

I would like to see some documentation of these tests when you come to productionise

ljones140 · 2024-09-25T09:38:14Z

tools/harvester-forwarding/src/functions/compareDefinitions.js

+    };
+}
+
+function compareValues(val1, val2, ignoredKeys, path) {


This looks great.
I would get frustrated having to deal with JS type, null and undefined comparisons 😄

ljones140 · 2024-09-25T09:39:16Z

tools/harvester-forwarding/src/functions/compareDefinitions.js

+
+    if (results.includes('improvement')) {
+        return 'improvement';
+    }


I really like the way you've thought about this.
Having these specific categories really helps the person who sees the results

ljones140 · 2024-09-25T09:41:13Z

tools/harvester-forwarding/src/functions/getRecentDefinitions.js

+  },
+});
+
+async function getData(context, days, limitPerType) {


Probably very unlikely it would happen.
If the Mongo DB schema was to change, I assume this test would fail and then whoever looked it would have to update.

But I doubt it will change this far into the project

qtomlinson · 2024-09-25T22:27:57Z

tools/integration/test/integration/e2e-test-service/definitionTest.js

@@ -14,19 +14,16 @@ describe('Validation definitions between dev and prod', function () {
  //Rest a bit to avoid overloading the servers
  afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

-  describe('Validation between dev and prod', function () {
-    before(() => {
-      loadFixtures().forEach(([url, definition]) =>


Could you please elaborate the reason why loadFixture is removed? Some definitions are different between Dev and Prod for the following reasons:

Fixes implemented in Dev, but component is not reharvested on Prod. Eventually when the component is eventually correct on Prod, the fixture can be removed.

Feature implemented in Dev but not yet deployed to Prod, e.g. LicenseRef.

I've removed the fixtures because their purpose was not clear to me. It's not obvious to me what kind of assurance we expect to get from a test that compares the real (dev) definition to a modified (mocked prod) definition.

If we expect the definitions to be different, due to a bug fix or a new feature, let's assert that in a test, for each definition separately. Mocking and expecting equality, as is done now, seems to work in the opposite direction, glossing over the expected difference.

I admit I might be misunderstanding the purpose of this whole test, so maybe I should leave it as is and create a new one that will do the comparison using the new comparison function? However my initial approach was to modify this test, and I've removed the mocks to make sure I compare the real data.

The goal of integration tests is to detect any breaking changes. For example, after the recent ScanCode upgrade was completed, the integration tests showed differences: some were improvements, and some were regressions (See clearlydefined/service#1056 (comment)). For the improvements that are not in the Production deployment, vetted improved definition files can be put into fixtures, allowing for a successful run before the Production deployment is updated. Ideally, after the production deployment is updated, the components whose definitions can be fixed by the new deployment can be re-harvested and the fixture can be removed, as the definitions in dev and prod should be in sync again. @elrayle, Feel free to add if anything is missed here.

Thanks for the information @qtomlinson. I've decided to keep this test unchanged and add my structured diff comparison as a separate step.

It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.

qtomlinson · 2024-09-25T22:48:09Z

tools/integration/test/integration/testConfig.js

+    return JSON.parse(data);
+  } catch (err) {
+    // If the file doesn't exist, fetch the data and save it to disk
+    const response = await fetch('https://cosmos-query-function-app.azurewebsites.net/api/getrecentdefinitions?days=1&limit=1');


Another source of recently harvested coordinates exists at the status endpoint: https://dev-api.clearlydefined.io/status/recentlycrawled. The response format is:

[ { "coordinates": "go/golang/github.com%2Fazure/azure-sdk-for-go/v43.3.0+incompatible", "timestamp": "2024-09-25T22:36:22.017Z" }, { "coordinates": "go/golang/github.com%2Fazure%2Fgo-autorest/autorest/v0.11.24", "timestamp": "2024-09-25T22:10:52.752Z" } ]

The internal logic is at service/statusService. It utilize application insight, so might be cheaper than cosmo query?

Thanks for the info, I wasn't aware of this endpoint. However it's not giving us enough data as you pointed out, so we'll have to rely on some other mechanism.

I think querying the CosmosDB should not increase the cost much if at all, because it's only done rarely and touches a limited amount of data.

The approach looks good. Another idea just occurred to me: the search query in the getRecentDefinition is based on _meta.updated, which is also the change publication based off. The changed coordinates published hourly can potentially be used to provide the recent coordinates by day and by type (through sorting). Just thought to mention it as an idea.

Right, I also thought of using the changes notifications mechanism for getting the recent definitions. For the relatively simple use case, as presented here, the data present in the changes notifications would be sufficient.

However I have plans to add some more data for these tests going forward. For example, one of the things we'd be interested in is to see whether we're getting rid of OTHER and NOASSERTION license entries when migrating to the ScanCode's LicenseRefs. For that we'd have to make a more elaborate query, and data from changes notifications mechanism will not be sufficient anymore. We'd have to query the database, using the same approach as presented in this PR (an Azure HTTP Function with CosmosDB access).

Do you have some concerns about the proposed database querying mechanism, @qtomlinson?

Yes DB query is more extensible. Thanks for the explanation and clarification!

qtomlinson · 2024-09-25T22:50:40Z

@RomanIakovlev The general approach looks good! It's great to see improvements and enhancements to the definition comparison!
Two things to consider:

There is an existing API at /status/recentlycrawled endpoint that fetches components harvested within the last day. However, it doesn't allow customization of days or types.
It may be worth considering including score comparison. Sometimes a discrepancy in scores may indicate something important. Case 2 in Add new summarizer for recent ScanCode versions service#1056 (comment) is an example of this, where the difference in scoring was eventually traced back to a difference in copyright detection (Add new summarizer for recent ScanCode versions service#1056 (comment)).

RomanIakovlev · 2024-09-27T14:40:46Z

@qtomlinson Thanks for your feedback Qing. Regarding the point of including scores into the structured diff, I'd prefer to keep the ignored keys as is for now, mainly for the sake of readability of the output.

I agree looking into the scores might be necessary sometimes, but I don't think we need that all the time, only in those special cases when there are other, more significant changes (e.g. copyright detection difference). For those times, we can manually run the diff with another set of keys. It might be worth having the list of ignored keys taken as a workflow parameter for those times, but I'd prefer to add this as a separate change in the future.

elrayle

Thanks for the substantial contribution to testing and raising confidence in release candidates. It appears to me that the one question that required resolution has been met.

elrayle · 2024-10-01T18:56:57Z

tools/integration/test/integration/e2e-test-service/definitionTest.js

@@ -14,19 +14,16 @@ describe('Validation definitions between dev and prod', function () {
  //Rest a bit to avoid overloading the servers
  afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

-  describe('Validation between dev and prod', function () {
-    before(() => {
-      loadFixtures().forEach(([url, definition]) =>


It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.

qtomlinson

Awesome work to add more integration tests. Minor edits can be put as a separate PR.

qtomlinson · 2024-10-01T21:57:14Z

tools/integration/test/integration/harvestTest.js

-    const status = await harvestTillCompletion(components)
+    const recentDefinitions = await getComponents()
+    console.info(`Recent definitions: ${recentDefinitions}`)
+    const status = await harvestTillCompletion(recentDefinitions)


nit: naming? recentDefinitions, these can be static component coordinates.

qtomlinson · 2024-10-01T23:01:40Z

tools/integration/lib/compareDefinitions.js

+  }
+}
+
+function compareValues(val1, val2, ignoredKeys, path) {


nit, naming: val2 seems to be the expected/baseline value, based on which regression or improvement is classified.

qtomlinson · 2024-10-01T23:06:54Z

tools/integration/lib/compareDefinitions.js

+}
+
+function isEmpty(value) {
+  if (value === null || value === undefined) return true


empty string?

qtomlinson · 2024-10-02T03:06:56Z

tools/integration/lib/compareDefinitions.js

+      return false
+    }
+  }
+  return true


nit: similar to return isSuperset(setB, setA)?

Are you suggesting merging the two methods since they are basically the same? I would definitely see that as a follow-on PR.

Definitely can be a separate PR. The logic seems to be similar to

function isSubset(setA, setB) { return isSuperset(setB, setA) }

qtomlinson · 2024-10-02T03:15:25Z

tools/integration/lib/compareDefinitions.js

+  return handleLargeArrays(val1, val2, path, 'inconclusive')
+}
+
+function handleLargeArrays(val1, val2, path, result) {


nit: naming? other types in addition to arrays are handled here

qtomlinson · 2024-10-02T03:19:12Z

tools/integration/lib/compareDefinitions.js

+  if (Array.isArray(val1) && Array.isArray(val2)) {
+    const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
+    const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
+


nit: toLowerCase after stringify?

qtomlinson · 2024-10-02T03:52:54Z

tools/harvester-forwarding/src/test/compareRequest.json

@@ -0,0 +1,224 @@
+{


Question: is this file for documentation purpose?

elrayle

I added some suggested changes for most of @qtomlinson questions. @RomanIakovlev what do you think about the suggested changes?

elrayle · 2024-10-02T11:20:39Z

tools/integration/lib/compareDefinitions.js

+  }
+}
+
+function compareValues(val1, val2, ignoredKeys, path) {


Perhaps...

Suggested change

function compareValues(val1, val2, ignoredKeys, path) {

function compareValues(devActual, expected, ignoredKeys, path) {

elrayle · 2024-10-02T11:23:09Z

tools/integration/lib/compareDefinitions.js

+}
+
+function isEmpty(value) {
+  if (value === null || value === undefined) return true


Suggested change

if (value === null || value === undefined) return true

if (value === null || value === undefined || value === '') return true

elrayle · 2024-10-02T11:35:28Z

tools/integration/lib/compareDefinitions.js

+  if (Array.isArray(val1) && Array.isArray(val2)) {
+    const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
+    const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))


Suggested change

if (Array.isArray(val1) && Array.isArray(val2)) {

const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))

const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))

if (Array.isArray(val1) && Array.isArray(val2)) {

const set1 = new Set(val1.map(item => JSON.stringify(item).toLowerCase())

const set2 = new Set(val2.map(item => JSON.stringify(item).toLowerCase())

elrayle · 2024-10-02T11:38:03Z

tools/integration/test/integration/harvestTest.js

+    const recentDefinitions = await getComponents()
+    console.info(`Recent definitions: ${recentDefinitions}`)
+    const status = await harvestTillCompletion(recentDefinitions)


Suggested change

const recentDefinitions = await getComponents()

console.info(`Recent definitions: ${recentDefinitions}`)

const status = await harvestTillCompletion(recentDefinitions)

const targetDefinitions = await getComponents()

console.info(`Recent definitions: ${targetDefinitions}`)

const status = await harvestTillCompletion(targetDefinitions)

Didn't check to see if the constant is used beyond these lines.

qtomlinson · 2024-10-02T14:22:47Z

tools/integration/lib/compareDefinitions.js

+    const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
+    const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
+
+    if (isSuperset(set1, set2)) {


When comparing arrays in certain cases, superset might not be improvement. For example, the file section in the definitions, the expected outcome is that the exact file paths listed in the production are detected in dev deployment.

RomanIakovlev · 2024-10-04T15:32:17Z

Thanks @elrayle and @qtomlinson for the reviews and change suggestions. As those required changes are not blocking, I'd rather merge this PR as it is now and address the changes in a separate one.

Initial version of harvester regression testing

6107258

ljones140 reviewed Sep 25, 2024

View reviewed changes

Use dynamic coordinates list for integration testing

7695556

qtomlinson reviewed Sep 25, 2024

View reviewed changes

RomanIakovlev added 2 commits September 26, 2024 18:04

Fix prettier issues

417dcb1

Bring back the prod data mocking for definitionTest

78f86e2

RomanIakovlev changed the title ~~Initial version of harvester regression testing~~ Harvester regression testing Sep 27, 2024

RomanIakovlev marked this pull request as ready for review September 27, 2024 14:30

RomanIakovlev force-pushed the roman/regression_testing branch from d3b5416 to 4361fd6 Compare September 27, 2024 14:33

RomanIakovlev force-pushed the roman/regression_testing branch from 4361fd6 to 601b3a8 Compare September 27, 2024 14:52

Add integration test for structured definitions diff

f93ec6a

RomanIakovlev force-pushed the roman/regression_testing branch from 601b3a8 to f93ec6a Compare September 27, 2024 15:07

elrayle approved these changes Oct 1, 2024

View reviewed changes

qtomlinson approved these changes Oct 2, 2024

View reviewed changes

elrayle approved these changes Oct 2, 2024

View reviewed changes

qtomlinson reviewed Oct 2, 2024

View reviewed changes

RomanIakovlev merged commit bfb99d7 into main Oct 4, 2024
2 checks passed

RomanIakovlev deleted the roman/regression_testing branch October 4, 2024 15:32

	function compareValues(val1, val2, ignoredKeys, path) {
	function compareValues(devActual, expected, ignoredKeys, path) {

	if (value === null \|\| value === undefined) return true
	if (value === null \|\| value === undefined \|\| value === '') return true

Harvester regression testing #91

Harvester regression testing #91

Conversation

RomanIakovlev commented Sep 18, 2024 • edited Loading

Making list of tested coordinates dynamic

Helping reviewers with definitions' comparison

ljones140 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson commented Sep 25, 2024 • edited Loading

RomanIakovlev commented Sep 27, 2024

elrayle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elrayle Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elrayle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RomanIakovlev commented Oct 4, 2024

RomanIakovlev commented Sep 18, 2024 •

edited

Loading

qtomlinson Sep 25, 2024 •

edited

Loading

qtomlinson Sep 25, 2024 •

edited

Loading

qtomlinson commented Sep 25, 2024 •

edited

Loading

elrayle Oct 2, 2024 •

edited

Loading