Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance regression testing workflow #332

Merged
merged 4 commits into from
Oct 18, 2023

Conversation

nirosys
Copy link
Contributor

@nirosys nirosys commented Oct 17, 2023

Issue #, if available: n/a

Description of changes:
This PR adds a GHA workflow that builds both the HEAD of the target branch, as well as the PR's revision, and runs the ion-bench tool to generate benchmark results for a set of 200KiB binary ion datasets.

This PR adds only the read (deserialize_all) benchmark to the workflow. An update to add the write (serialize_all) benchmark will follow.

The workflow uses google-benchmark's compare.py which performs a null-hypothesis check agains the data to determine whether the results from both benchmark runs are statistically similar or not. This test is currently not used for the validation. This is primarily because I need to understand the tuning of the alpha parameter, as well as what to expect from the test in a noisy environment like GHA.

Example Run: here


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@linlin-s
Copy link

Would you mind adding the sample run of this workflow to the description?

Comment on lines +2 to +4
on:
pull_request:
branches: [ master ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to trigger the workflow only when the source code of ion-c has been changed.

on:
pull_reuqest: 
  paths:
        - 'ionc/*'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, adding.

Comment on lines 39 to 45
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema01.isl testData/realWorldDataSchema01.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema02.isl testData/realWorldDataSchema02.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema03.isl testData/realWorldDataSchema03.10n

java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedList.isl testData/nestedList.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedStruct.isl testData/nestedStruct.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/sexp.isl testData/sexp.10n

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We might prefer to use a for loop to avoid repeated logic.

for test in nestedStruct nestedList sexp realWorldDataSchema01 realWorldDataSchema02 realWorldDataSchema03
do
java -jar $jar_file generate -S ${{env.data_size}} --input-ion-schema $schema_dir/${test}.isl testData/${test}.10n
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea.. I had originally broken it out because what I had taken from either ion-python's, or ion-java's
workflow had -S 100 which resulted in some wildly different dataset sizes. I think the sexp.10n was 195bytes, which definitely wasn't large enough to overcome the normal runtime noise (5% diff ended up being ~195ns on my system). Once I saw what -S does, and realized what was going on, I didn't re-loop it. I'll tidy that up.

[[ -e /usr/bin/cmake ]] || ln -s `which cmake3` /usr/bin/cmake

- name: Get Data Generator
uses: actions/checkout@v2
Copy link

@linlin-s linlin-s Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pin to a specific commit instead of commit hash tag? Tags can be updated and re-pointed, so the code referenced by a tag can change. However the commit SHA represents the exact snapshot of the code at a point in time.

The latest version of actions/checkout is recommended. actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
Same suggestion to other actions/checkout call.

env:
compare: candidate/tools/ion-bench/deps/google-benchmark/tools/compare.py
cli_path: candidate/build/profiling/tools/ion-bench/src/IonCBench
alpha: 0.03

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know if alpha value can be adjusted based on the accuracy of regression-detection workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, alpha can be adjusted in order to fine-tune the null-hypothesis test. That is why it is exposed, so that we can adjust as needed. However, right now the results of the null-hypothesis test aren't being used because the p-value that is being calculated between revisions almost always determines the benchmarks to have statistically-significant differences.

I plan to spend a little more time looking into the test to see if it is appropriate for these benchmarks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

@nirosys
Copy link
Contributor Author

nirosys commented Oct 17, 2023

Would you mind adding the sample run of this workflow to the description?

@linlin-s, I'm not sure what you mean. The workflow ran as part of the checks, so all of the output is available there. The full output is a bit much for the PR description, do you mean just the output of the compare.py?

@linlin-s
Copy link

Would you mind adding the sample run of this workflow to the description?

@linlin-s, I'm not sure what you mean. The workflow ran as part of the checks, so all of the output is available there. The full output is a bit much for the PR description, do you mean just the output of the compare.py?

Apologize for the confusion. What I mean by sample run is already included in the checks. We can track the current workflow run in the checks since it is triggered by pull_request to master. After we update the trigger event, we might have to manually trigger the workflow to see the results since the workflow will be triggered only when the source code changed. In this case it would be easier to review if we could include the hyperlink of the updated workflow run.

@nirosys nirosys marked this pull request as ready for review October 17, 2023 20:36
threshold_perc: 5
run: |
echo "Printing Results"
RESULTS=$(cat results.json | jq '.[] | select(.run_type == "aggregate" and (.name | endswith("_mean"))) | {name:.name,cpu_time_perc_diff:(.measurements[0].cpu*100)}|select(.cpu_time_perc_diff > '"${threshold_perc}"')')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve this PR with one question here: When the cpu_time_perc_diff > 0, it represents the candidates CPU run time is bigger than the baseline CPU run time right? If the value is bigger than the threshold value we will fail the workflow with regression detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The way the compare.py generates its output, we end up with data like this:

  {
    "name": "realWorldDataSchema03.10n_mean",
    "label": "",
    "measurements": [
      {
        "real_time": 12452149.371052643,
        "cpu_time": 12451057.894736838,
        "real_time_other": 12473477.750893278,
        "cpu_time_other": 12470339.46428573,
        "time": 0.0017128271758622316,
        "cpu": 0.0015485888598303227
      }
    ],
    "time_unit": "ns",
    "run_type": "aggregate",
    "aggregate_name": "mean",
    "utest": {}
  },

This represents the baseline (cpu_time) and candidate's (cpu_time_other) mean (of 20 runs) wallclock, and cpu time for the realWorldDataSetSchema03 test. The cpu field represents the relative difference ((candidate - baseline)/(baseline)) between the candidate and the baseline. Likewise, the time field represents the relative wallclock difference. Since the relative difference is a percentage of the baseline, I multiply it by 100 it so we can configure the threshold with whole values. Any positive value is a results of the candidate's time being larger than the baseline's.

The check step, passes the JSON output through that jq query. The JSON output from the compare.py, contains both the raw results from the candidate run, but also the comparisons to the baseline like I mentioned above. The jq query looks for any aggregate comparisons, for the mean, and gathers only results that exceed the threshold. If there are no values, the RESULTS variable will be an empty string, otherwise it will contain JSON that includes the name and CPU %diff for each failing run. The step then just checks to see if RESULTS is empty or not, when deciding if the step should fail or not. If it's not, it will also use jq to print out the name and %diff for each of the tests that exceed the threshold.

Hah.. typing this up made me realize I missed the aggregate_name field. I could make the jq query a little friendlier. I'll follow up with that in a later PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Sounds good to me.

@nirosys nirosys merged commit 4808138 into amazon-ion:master Oct 18, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants