Add performance regression testing workflow #332

nirosys · 2023-10-17T10:26:17Z

Issue #, if available: n/a

Description of changes:
This PR adds a GHA workflow that builds both the HEAD of the target branch, as well as the PR's revision, and runs the ion-bench tool to generate benchmark results for a set of 200KiB binary ion datasets.

This PR adds only the read (deserialize_all) benchmark to the workflow. An update to add the write (serialize_all) benchmark will follow.

The workflow uses google-benchmark's compare.py which performs a null-hypothesis check agains the data to determine whether the results from both benchmark runs are statistically similar or not. This test is currently not used for the validation. This is primarily because I need to understand the tuning of the alpha parameter, as well as what to expect from the test in a noisy environment like GHA.

Example Run: here

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…led already

linlin-s · 2023-10-17T17:55:03Z

Would you mind adding the sample run of this workflow to the description?

linlin-s · 2023-10-17T17:06:25Z

.github/workflows/performance-regression.yml

+on:
+  pull_request:
+    branches: [ master ]


We might want to trigger the workflow only when the source code of ion-c has been changed.

on: pull_reuqest: paths: - 'ionc/*'

Good call, adding.

linlin-s · 2023-10-17T17:15:15Z

.github/workflows/performance-regression.yml

+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema01.isl testData/realWorldDataSchema01.10n
+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema02.isl testData/realWorldDataSchema02.10n
+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema03.isl testData/realWorldDataSchema03.10n
+
+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedList.isl testData/nestedList.10n
+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedStruct.isl testData/nestedStruct.10n
+          java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/sexp.isl testData/sexp.10n


Nitpick: We might prefer to use a for loop to avoid repeated logic.

for test in nestedStruct nestedList sexp realWorldDataSchema01 realWorldDataSchema02 realWorldDataSchema03 do java -jar $jar_file generate -S ${{env.data_size}} --input-ion-schema $schema_dir/${test}.isl testData/${test}.10n done

Yea.. I had originally broken it out because what I had taken from either ion-python's, or ion-java's
workflow had -S 100 which resulted in some wildly different dataset sizes. I think the sexp.10n was 195bytes, which definitely wasn't large enough to overcome the normal runtime noise (5% diff ended up being ~195ns on my system). Once I saw what -S does, and realized what was going on, I didn't re-loop it. I'll tidy that up.

linlin-s · 2023-10-17T17:17:02Z

.github/workflows/performance-regression.yml

+          [[ -e /usr/bin/cmake ]] || ln -s `which cmake3` /usr/bin/cmake
+
+      - name: Get Data Generator
+        uses: actions/checkout@v2


Could we pin to a specific commit instead of commit hash tag? Tags can be updated and re-pointed, so the code referenced by a tag can change. However the commit SHA represents the exact snapshot of the code at a point in time.

The latest version of actions/checkout is recommended. actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
Same suggestion to other actions/checkout call.

linlin-s · 2023-10-17T17:51:32Z

.github/workflows/performance-regression.yml

+        env:
+          compare: candidate/tools/ion-bench/deps/google-benchmark/tools/compare.py
+          cli_path: candidate/build/profiling/tools/ion-bench/src/IonCBench
+          alpha: 0.03


I'd like to know if alpha value can be adjusted based on the accuracy of regression-detection workflow?

Absolutely, alpha can be adjusted in order to fine-tune the null-hypothesis test. That is why it is exposed, so that we can adjust as needed. However, right now the results of the null-hypothesis test aren't being used because the p-value that is being calculated between revisions almost always determines the benchmarks to have statistically-significant differences.

I plan to spend a little more time looking into the test to see if it is appropriate for these benchmarks.

Thanks for the explanation.

nirosys · 2023-10-17T19:58:19Z

Would you mind adding the sample run of this workflow to the description?

@linlin-s, I'm not sure what you mean. The workflow ran as part of the checks, so all of the output is available there. The full output is a bit much for the PR description, do you mean just the output of the compare.py?

linlin-s · 2023-10-17T20:24:03Z

Would you mind adding the sample run of this workflow to the description?

@linlin-s, I'm not sure what you mean. The workflow ran as part of the checks, so all of the output is available there. The full output is a bit much for the PR description, do you mean just the output of the compare.py?

Apologize for the confusion. What I mean by sample run is already included in the checks. We can track the current workflow run in the checks since it is triggered by pull_request to master. After we update the trigger event, we might have to manually trigger the workflow to see the results since the workflow will be triggered only when the source code changed. In this case it would be easier to review if we could include the hyperlink of the updated workflow run.

linlin-s · 2023-10-17T21:05:03Z

.github/workflows/performance-regression.yml

+          threshold_perc: 5
+        run: |
+          echo "Printing Results"
+          RESULTS=$(cat results.json | jq '.[] | select(.run_type == "aggregate" and (.name | endswith("_mean"))) | {name:.name,cpu_time_perc_diff:(.measurements[0].cpu*100)}|select(.cpu_time_perc_diff > '"${threshold_perc}"')')


Approve this PR with one question here: When the cpu_time_perc_diff > 0, it represents the candidates CPU run time is bigger than the baseline CPU run time right? If the value is bigger than the threshold value we will fail the workflow with regression detected.

Yes, exactly. The way the compare.py generates its output, we end up with data like this:

{ "name": "realWorldDataSchema03.10n_mean", "label": "", "measurements": [ { "real_time": 12452149.371052643, "cpu_time": 12451057.894736838, "real_time_other": 12473477.750893278, "cpu_time_other": 12470339.46428573, "time": 0.0017128271758622316, "cpu": 0.0015485888598303227 } ], "time_unit": "ns", "run_type": "aggregate", "aggregate_name": "mean", "utest": {} },

This represents the baseline (cpu_time) and candidate's (cpu_time_other) mean (of 20 runs) wallclock, and cpu time for the realWorldDataSetSchema03 test. The cpu field represents the relative difference ((candidate - baseline)/(baseline)) between the candidate and the baseline. Likewise, the time field represents the relative wallclock difference. Since the relative difference is a percentage of the baseline, I multiply it by 100 it so we can configure the threshold with whole values. Any positive value is a results of the candidate's time being larger than the baseline's.

The check step, passes the JSON output through that jq query. The JSON output from the compare.py, contains both the raw results from the candidate run, but also the comparisons to the baseline like I mentioned above. The jq query looks for any aggregate comparisons, for the mean, and gathers only results that exceed the threshold. If there are no values, the RESULTS variable will be an empty string, otherwise it will contain JSON that includes the name and CPU %diff for each failing run. The step then just checks to see if RESULTS is empty or not, when deciding if the step should fail or not. If it's not, it will also use jq to print out the name and %diff for each of the tests that exceed the threshold.

Hah.. typing this up made me realize I missed the aggregate_name field. I could make the jq query a little friendlier. I'll follow up with that in a later PR.

Thanks for the detailed explanation. Sounds good to me.

Add performance regression testing workflow

071ef00

nirosys requested review from cheqianh and linlin-s October 17, 2023 10:26

nirosys added 2 commits October 17, 2023 03:28

Remove dependency installs from testing, GHA should have these instal…

d16073e

…led already

Add depth, and request tags, for code checkout

e502337

linlin-s reviewed Oct 17, 2023

View reviewed changes

Changes from PR feedback

ba8b3a2

nirosys marked this pull request as ready for review October 17, 2023 20:36

linlin-s reviewed Oct 17, 2023

View reviewed changes

linlin-s approved these changes Oct 17, 2023

View reviewed changes

tgregg approved these changes Oct 17, 2023

View reviewed changes

nirosys merged commit 4808138 into amazon-ion:master Oct 18, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add performance regression testing workflow #332

Add performance regression testing workflow #332

nirosys commented Oct 17, 2023 •

edited

Loading

linlin-s commented Oct 17, 2023

linlin-s Oct 17, 2023

nirosys Oct 17, 2023

linlin-s Oct 17, 2023

nirosys Oct 17, 2023

linlin-s Oct 17, 2023 •

edited

Loading

linlin-s Oct 17, 2023

nirosys Oct 17, 2023

linlin-s Oct 17, 2023

nirosys commented Oct 17, 2023

linlin-s commented Oct 17, 2023

linlin-s Oct 17, 2023

nirosys Oct 17, 2023

linlin-s Oct 18, 2023

Add performance regression testing workflow #332

Add performance regression testing workflow #332

Conversation

nirosys commented Oct 17, 2023 • edited Loading

linlin-s commented Oct 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linlin-s Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirosys commented Oct 17, 2023

linlin-s commented Oct 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirosys commented Oct 17, 2023 •

edited

Loading

linlin-s Oct 17, 2023 •

edited

Loading