Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance regression testing workflow #332

Merged
merged 4 commits into from
Oct 18, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions .github/workflows/performance-regression.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
name: Performance Regression Detection
on:
pull_request:
branches: [ master ]
Comment on lines +2 to +4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to trigger the workflow only when the source code of ion-c has been changed.

on:
pull_reuqest: 
  paths:
        - 'ionc/*'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, adding.

jobs:
detect-regression:
name: Detect Regression
runs-on: ubuntu-latest
strategy:
fail-fast: true
env:
CC: 'clang'
CXX: 'clang++'
steps:
- name: Get Data Generator
uses: actions/checkout@v2
Copy link

@linlin-s linlin-s Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pin to a specific commit instead of commit hash tag? Tags can be updated and re-pointed, so the code referenced by a tag can change. However the commit SHA represents the exact snapshot of the code at a point in time.

The latest version of actions/checkout is recommended. actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
Same suggestion to other actions/checkout call.

with:
repository: amazon-ion/ion-data-generator
ref: main
path: ion-data-generator

- name: Build Ion Data Generator
run: cd ion-data-generator && mvn clean install

- name: Generate Data
env:
jar_file: ion-data-generator/target/ion-data-generator-1.0-SNAPSHOT.jar
schema_dir: ion-data-generator/tst/com/amazon/ion/workflow
run: |
mkdir -p testData
# Generate approximately 200KB of data for each dataset, so that we can expect similar orders of magnitude for
# our threshold.
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema01.isl testData/realWorldDataSchema01.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema02.isl testData/realWorldDataSchema02.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema03.isl testData/realWorldDataSchema03.10n

java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedList.isl testData/nestedList.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedStruct.isl testData/nestedStruct.10n
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/sexp.isl testData/sexp.10n

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We might prefer to use a for loop to avoid repeated logic.

for test in nestedStruct nestedList sexp realWorldDataSchema01 realWorldDataSchema02 realWorldDataSchema03
do
java -jar $jar_file generate -S ${{env.data_size}} --input-ion-schema $schema_dir/${test}.isl testData/${test}.10n
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea.. I had originally broken it out because what I had taken from either ion-python's, or ion-java's
workflow had -S 100 which resulted in some wildly different dataset sizes. I think the sexp.10n was 195bytes, which definitely wasn't large enough to overcome the normal runtime noise (5% diff ended up being ~195ns on my system). Once I saw what -S does, and realized what was going on, I didn't re-loop it. I'll tidy that up.


- name: Fetch PR Candidate
uses: actions/checkout@v3
with:
submodules: recursive
path: candidate
fetch-tags: true
fetch-depth: 50

- name: Build PR Candidate
run: |
mkdir -p candidate/build/profiling && cd candidate/build/profiling
cmake -DCMAKE_BUILD_TYPE=Profiling -DIONC_BUILD_TESTS=OFF ../..
make clean && make IonCBench

- name: Fetch PR Baseline
uses: actions/checkout@v3
with:
ref: ${{ github.base_ref }}
submodules: recursive
path: baseline
fetch-tags: true
fetch-depth: 50

- name: Build PR Baseline
run: |
mkdir -p baseline/build/profiling && cd baseline/build/profiling
cmake -DCMAKE_BUILD_TYPE=Profiling -DIONC_BUILD_TESTS=OFF ../..
make clean && make IonCBench

# This step runs benchmarks for the current ion-c repo.
- name: 'Benchmark: Baseline'
env:
cli_path: baseline/build/profiling/tools/ion-bench/src/IonCBench
run: |
$cli_path -b deserialize_all -l ion-c-binary \
--benchmark_context=uname="`uname -srm`" \
--benchmark_context=proc="`cat /proc/cpuinfo | fgrep 'model name' | head -n 1 | cut -d: -f2 | cut -d' ' -f2-`" \
--benchmark_repetitions=20 \
--benchmark_out_format=json \
--benchmark_out='./baseline.json' \
--benchmark_min_warmup_time=5 \
-d testData/nestedStruct.10n \
-d testData/nestedList.10n \
-d testData/sexp.10n \
-d testData/realWorldDataSchema01.10n \
-d testData/realWorldDataSchema02.10n \
-d testData/realWorldDataSchema03.10n
# This step runs benchmarks on each of the generated datsets for the new revision. It does this through
# the 'compare.py' script provided by google-benchmark, which will compare the results of the benchmarks to
# the results of the baseline benchmarks from the previous step.
#
# The compare script uses the defined 'alpha' environment variable to perform a null-hypothesis test,
# which is used to determine whether the two sets of benchmark times come from the same distribution.
- name: 'Benchmark: PR Candidate'
env:
compare: candidate/tools/ion-bench/deps/google-benchmark/tools/compare.py
cli_path: candidate/build/profiling/tools/ion-bench/src/IonCBench
alpha: 0.03

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know if alpha value can be adjusted based on the accuracy of regression-detection workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, alpha can be adjusted in order to fine-tune the null-hypothesis test. That is why it is exposed, so that we can adjust as needed. However, right now the results of the null-hypothesis test aren't being used because the p-value that is being calculated between revisions almost always determines the benchmarks to have statistically-significant differences.

I plan to spend a little more time looking into the test to see if it is appropriate for these benchmarks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

run: |
pip install -r candidate/tools/ion-bench/deps/google-benchmark/tools/requirements.txt
$compare -a -d ./results.json --alpha $alpha benchmarks \
./baseline.json \
$cli_path -b deserialize_all -l ion-c-binary \
--benchmark_context=uname="`uname -srm`" \
--benchmark_context=proc="`cat /proc/cpuinfo | fgrep 'model name' | head -n 1 | cut -d: -f2 | cut -d' ' -f2-`" \
--benchmark_repetitions=20 \
--benchmark_out_format=json \
--benchmark_out='./candidate.json' \
--benchmark_min_warmup_time=5 \
-d testData/nestedStruct.10n \
-d testData/nestedList.10n \
-d testData/sexp.10n \
-d testData/realWorldDataSchema01.10n \
-d testData/realWorldDataSchema02.10n \
-d testData/realWorldDataSchema03.10n

# Upload the results.json for further review.
- name: 'Upload Results'
uses: actions/upload-artifact@v2
with:
name: results.json
path: ./results.json

# This step compares the 2 benchmark runs and attempts to determine whether the runs are significantly
# different enough to warrant a failure to at least get someone to look at the results.
#
# Currently, this check looks at the generated comparison of the MEAN of each benchmarks' CPU time. We
# do this for now, rather than use the null-hypothesis results, until we get a better understanding of
# how the timings will be in GHA.
- name: 'Check Results'
env:
# Threshold Percentage, currently 5%.
threshold_perc: 5
run: |
echo "Printing Results"
RESULTS=$(cat results.json | jq '.[] | select(.run_type == "aggregate" and (.name | endswith("_mean"))) | {name:.name,cpu_time_perc_diff:(.measurements[0].cpu*100)}|select(.cpu_time_perc_diff > '"${threshold_perc}"')')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve this PR with one question here: When the cpu_time_perc_diff > 0, it represents the candidates CPU run time is bigger than the baseline CPU run time right? If the value is bigger than the threshold value we will fail the workflow with regression detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The way the compare.py generates its output, we end up with data like this:

  {
    "name": "realWorldDataSchema03.10n_mean",
    "label": "",
    "measurements": [
      {
        "real_time": 12452149.371052643,
        "cpu_time": 12451057.894736838,
        "real_time_other": 12473477.750893278,
        "cpu_time_other": 12470339.46428573,
        "time": 0.0017128271758622316,
        "cpu": 0.0015485888598303227
      }
    ],
    "time_unit": "ns",
    "run_type": "aggregate",
    "aggregate_name": "mean",
    "utest": {}
  },

This represents the baseline (cpu_time) and candidate's (cpu_time_other) mean (of 20 runs) wallclock, and cpu time for the realWorldDataSetSchema03 test. The cpu field represents the relative difference ((candidate - baseline)/(baseline)) between the candidate and the baseline. Likewise, the time field represents the relative wallclock difference. Since the relative difference is a percentage of the baseline, I multiply it by 100 it so we can configure the threshold with whole values. Any positive value is a results of the candidate's time being larger than the baseline's.

The check step, passes the JSON output through that jq query. The JSON output from the compare.py, contains both the raw results from the candidate run, but also the comparisons to the baseline like I mentioned above. The jq query looks for any aggregate comparisons, for the mean, and gathers only results that exceed the threshold. If there are no values, the RESULTS variable will be an empty string, otherwise it will contain JSON that includes the name and CPU %diff for each failing run. The step then just checks to see if RESULTS is empty or not, when deciding if the step should fail or not. If it's not, it will also use jq to print out the name and %diff for each of the tests that exceed the threshold.

Hah.. typing this up made me realize I missed the aggregate_name field. I could make the jq query a little friendlier. I'll follow up with that in a later PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Sounds good to me.

if [[ -z "$RESULTS" ]]; then
echo "No sizeable difference identified"
else
echo "CPU Time differences greater than ${threshold_perc}%"
echo "$RESULTS" | jq -r '"\(.name) = \(.cpu_time_perc_diff)"'
exit 1
fi

Loading