This is the artifact for our paper "Rewrite Rule Inference Using Equality Saturation", which is available on the ACM DL or arXiv. In our paper, we presented a framework, Ruler, that uses equality saturation to automatically infer small, expressive rulesets for a domain, given an interpreter.
- Available: The artifact is available on Zenodo.
- Functional: Below we first provide instructions for "Getting Started" which should take less than 10 minutes. The next part is "Step-by-Step" which first lists the claims we made in the paper and provides instructions on how to validate them.
- Reusable: Finally, we also provide instructions on "Further Use / Extending Ruler" which describes how to install Ruler on a different machine, modify the code, and perform further experiments.
If you use this work, please cite it using the following BibTeX.
BibTeX
@article{ruler,
author = {Nandi, Chandrakana
and Willsey, Max
and Zhu, Amy
and Wang, Yisu Remy
and Saiki, Brett
and Anderson, Adam
and Schulz, Adriana
and Grossman, Dan
and Tatlock, Zachary},
title = {Rewrite Rule Inference Using Equality Saturation},
year = {2021},
issue_date = {October 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {5},
number = {OOPSLA},
url = {https://doi.org/10.1145/3485496},
doi = {10.1145/3485496},
journal = {Proc. ACM Program. Lang.},
month = {oct},
articleno = {119},
numpages = {28},
keywords = {Program Synthesis, Rewrite Rules, Equality Saturation}
}
We recommend running the VM on a machine with more than 32 GB RAM
to reproduce all the results from scratch.
If you are only running eval.sh
to produce results from previous runs (see below),
you should be fine with a machine that has 16 GB RAM.
-
Download the
.ova
file here and open it with Virtual Box by going toFile -> import appliance
and giving the path to the.ova
file and clicking oncontinue
. In the next window that pops up, click onImport
. It should take a few minutes to import. -
If you plan on running the full eval from scratch, change the RAM to 32 GB (or more) by going to
Settings
underSystem -> Motherboard
. Otherwise, skip this step. -
Next, open the virtual machine image in virtual box by clicking on the green
Start
button. -
Login is automatic, but in case needed, the password is:
ruler
. -
Open a terminal window. The project repository is already cloned. Navigate to the
ruler
directory. All the required packages are already installed and Ruler is already compiled for you, ready to be run. -
To allow a quick verification of our artifact, we provided data from previous runs. You can therefore directly view the results (see below on how to do that).
-
You can also run the tool yourself entirely from scratch, as shown below for each section of the evaluation.
Note that in the paper, the evaluations were often done in larger scale which is not possible on the VM. We have provided recommendations below on how to run it on a smaller scale (e.g., fewer iterations, fewer seeds) when possible.
To check that you are able to run Ruler, type the following in the command line:
cargo bool
This should take less than a second (when ruler is pre-built, and it is) and should generate and print 5 rewrite rules on the console and the time it took to infer them.
Our paper has 4 quantitative evaluations:
-
Comparing with CVC4 (
Section 4
): We show that Ruler can infer smaller, powerful rulesets faster by comparing the rules inferred forbool
,bv4
, andbv32
with varying expression sizes (2, 3). The results are inTable 1
. -
Integrating with Herbie (
Section 5
): We show that Ruler's rules can be used to replace human-written rules by comparing the Herbie tool's results in fours different configurations:None
,Herbie
,Ruler
,Both
. The results are inFigure 7
. -
Search Parameter Analysis (
Section 6.1
): We profiled Ruler's search algorithm to measure how much time is spent in each phase.Figure 8
shows the results forbv4
,bv32
, andrationals
domains. We also compared different variations ofchoose_eqs
by varyingn
inFigure 5, Line 3
, whose default value is infinity. The results are shown inFigure 9a
forbv4
,bv32
, andrationals
. Importantly, we measure both running time, and the number of rules learned. We also measured running time, number of rules learned, and number of e-classes in the egraph with and without invokingrun_rewrites
(Figure 4, Line 9
) to study its effect. The results are shown inFigure 9b
forbv4
,bv32
, andrationals
. -
Validation Analysis (
Section 6.2
): We compared different rule validation methods forbv4
,bv32
, andrationals
. The results are shown inTable 2
.
Most of our experiments will take several hours to run
from scratch (see "More Detailed Experiments" below),
and some are better run on bigger machines than on a VM as they generate a lot of data.
We instead recommend running scripts/eval.sh
:
cd scripts
./eval.sh
to generate the results from previous runs that we have included in the VM. We expect it to take less than 10 seconds. This script runs all four experiments and here is how you can view the results.
Table 1
will be printed as the first item to the terminal.
Compare it with the paper.
Note that in some cases the numbers
will not match exactly with what
we reported in the paper because
the VM is less powerful than the machine
we had for running the eval in the paper,
and sometimes the heuristics also may have slightly different effects (see paper for details).
The three plots in Figure 7
will be
generated and put in herbie-rational/output/ruler-herbie-eval/results/pre-gen-2021-04-13-1331
.
Open the following three PDFs
- Figure 7a:
by-config-all-tests-avg_bits_err_improve-boxplot.pdf
- Figure 7b:
by-config-all-tests-output_parens-boxplot.pdf
- Figure 7c:
by-config-all-tests-time-boxplot.pdf
to compare with the paper.
The plots in Figure 8, 9
will be generated
using data from previous runs in the submitted-data
directory
and can be found under output
.
Open the following PDFs
- Figure 8:
ablation/output/by-domain-phase-times.pdf
- Figure 9a:
ablation/output/bv4-by-config-rules-learned.pdf
ablation/output/bv32-by-config-rules-learned.pdf
ablation/output/rational-by-config-rules-learned.pdf
- Figure 9b:
ablation/output/rr-bv4-run-rewrites.pdf
ablation/output/rr-bv32-run-rewrites.pdf
ablation/output/rr-rational-timeout.pdf
to compare with the paper.
Table 2
will be printed as the last item to the terminal.
Note that the first table is the one for bv32
,
then second for bv4
, and
third for rational
.
You can compare these with the one in the paper.
Other than some variation in the timing numbers,
the tables should be identical.
This section describes how you can run each experiment separately. It also has instructions on generating all the data from scratch. Overall, generating all results from scratch will take about 20 hours (see below for details). Most of them also require allocating at least 32 GB to the VM.
The goal is to reproduce Table 1
.
-
Type
cd $HOME/ruler/scripts/cvc4-eval/
to go to the correct directory. -
To generate the table from the previous runs, run
make
and it will print it to the terminal instantly. -
To regenerate the data, run
make clean
to remove all results and runmake
again. This should work without allocating additional RAM to the VM. This will take approximately 1.5 hours.
Note that in some cases the numbers will not match exactly with what we reported in the paper because the VM is less powerful than the machine we had for running the eval in the paper (see paper for details), and the heuristics also may have slightly different effects.
- The
Makefile
is the main script for this part. cvc4/
has the grammars from CVC4's rule inference tool that we used.- The script runs both Ruler and CVC4 for
bool
,bv4
, andbv32
with 2 variables, 3 variables, 2 iterations, and 3 iterations. It then generates reports in json and uses the derivability test (derive.rs
) to compare the proving power of the rulesets from Ruler and CVC4. - The table is generated by the
compare.py
script and using pandoc and xsv.
The goal is to reproduce Figure 7
.
Herbie is an external tool which we used for this evaluation.
Herbie is already installed in the VM together with
the required racket 7.9 version.
-
Type
cd $HOME/ruler/scripts/herbie-rational
to go to the correct directory. -
To simply view pre-made plots, you can directly go to
herbie-rational/output/ruler-herbie-eval/results/submitted-plots
and look at the PDFs. -
To generate plots from previous runs, run
plots/plot-results.sh output/ruler-herbie-eval/results/pre-gen-2021-04-13-1331
and open the following three PDFs under
output/ruler-herbie-eval/results/pre-gen-2021-04-13-1331
:- for Figure 7a:
by-config-all-tests-avg_bits_err_improve-boxplot.pdf
- for Figure 7b:
by-config-all-tests-output_parens-boxplot.pdf
- for Figure 7c:
by-config-all-tests-time-boxplot.pdf
- for Figure 7a:
-
To generate the data, run:
./herbie-eval.sh 15
. This part requires increasing the RAM of the VM to 32 GB or more. You can run it for fewer or more seeds by typing./herbie-eval.sh NSEEDS
(default is 1). We recommend trying with15
seeds to check the results (look for plots with same names as mentioned above in the timestamped directory underoutput/ruler-herbie-eval/results/
) -- they should have a similar trend to the ones presented in the paper. In the VM this should take approximately 10 hours. You may notice some timeouts / errors (especially for theherbie-no-simpl
case) on some benchmarks but those are due to Herbie, not Ruler.
In the paper we ran the experiment with 30
seeds (on a large machine) but that takes much longer,
and is better run on a real machine as opposed to a VM because it will
generate more data and will be much slower on the VM.
You are of course welcome to run it on the VM if you like.
We expect it will take 24 hours.
Other Herbie specific
arguments to the herbie-eval.sh
script are set to their defaults but
the script has documentation for how to change them.
The script will print the configuration being used.
To map them to the figure, use the following guide:
herbie-no-simpl
isNone
herbie-only
isHerbie
ruler-only
isRuler
herbie-ruler
isBoth
-
herbie-eval.sh
is the main script and it has comments to indicate what it does. -
seed-variance.sh
is the script that runs Herbie with different seeds and generates the reports. -
All results are saved in
output/ruler-herbie-eval/results/
under timestamped directories. The rules used are also saved in a txt file. -
Since we are still actively working on Ruler, there are some scripts that may not be relevant for this part of the evaluation. Below are the scripts relevant for this eval and a brief description of what they do:
filter.rkt
filters benchmarks from Herbie that contain only rational operators.preprocess.py
preprocesses Ruler's rewrites to make them match with Herbie's syntax, and also removes expansive directions of rules.
Plotting scripts are in plots/
directory.
plots/plot-results.sh
calls these scripts to generate the plots.plots/config-all-tests-box-plot.py
is the script that generates the plots in the paper.
The goal is to reproduce Figure 8
and Figure 9
.
-
Type
cd $HOME/ruler/scripts/ablation
to go to the correct directory. -
To only view the plots, go to
submitted-plots/
.Figure 8
in the paper corresponds to the10-run/by-domain-phase-times.pdf
plot.Figure 9a
plots are the pdfs under10-run/bv4
,10-run/bv32
, and10-run/rat
.Figure 9b
plots are the pdfs underorat-rr/bv4
,orat-rr/bv32
, andorat-rr/rat
(orat
means "One Rule At a Time" which corresponds ton = 1
in the caption in the paper).
-
To make plots from the previous runs, run
./ablation.sh -r use-existing
. This will make plots using the data provided in the foldersubmitted-data
and put them into theoutput
folder. It will also print some of the data in the terminal, which we used for debugging. Feel free to ingore that. The.tar
file in thesubmitted-data
folder contains the log of each run. This is not used, and is provided for interest only. -
To run your own evaluation and make new plots from scratch, run
./ablation.sh -r generate-new
. This part requires 32 GB of RAM allocated to the VM and may still run out of memory on some runs. Ideally, these should be run on a real machine, which is what we did for the paper. This runs Ruler with different configurations, saving each run to its own timestamped folder underoutput/
, and then parses the statistics from the log outputs. These statistics are collected into json files and then plotted in matplotlib. Resultant pdf plots are available inside the timestamped folder for that experiment. As we presented in the paper,orat
rationals in theno-rr
setting does not terminate. We have set a timeout for this run of 1 hour (3600 seconds) so that the evaluation will finish in a timely manner, but in the paper, this run did not terminate after 24 hours (seeFigure 9b
). If you want to wait less than 1 hour, or you want to test the non-termination of this run with a longer timeout, you can change the timeout setting (see the list of arguments available below). This part will take approximately 3 hours.
In the published evaluation, we ran Ruler with 3 variables, 2 iterations, over 10 runs, but we recommend only running for 1 run on the VM.
Note that timing results should only be compared between themselves and not as absolute values, since logging is enabled during the ablation runs.
run.sh
controls the entire experiment. It callsrun-ruler.sh
andrun-ruler-rr.sh
for each domain, then calls the parsing and visualizing scripts.run-ruler.sh
andrun-ruler-rr.sh
will call Ruler for a particular domain with the parameters provided, for as many runs as required. In particular,run-ruler.sh
handles generating the data for Figure 8 and Figure 9a, andrun-ruler-rr.sh
handles generating the data for Figure 9b.- To change the parameters, simply modify the arguments
passed to
run-ruler.sh
andrun-ruler-rr.sh
from insiderun.sh
.-v
is the number of variables,-i
is the iterations,-r
is number of runs (providing how many independent data points we average over),-d
is the domain-o
is the output folder-t
is the amount of time each run will execute before timing out.-t
only affectsrun-ruler-rr.sh
since there is a possibility of timeout inorat
for rational andno-rr
.
Lastly, any succeeding parameters will be passed directly to the Ruler invocation.
NOTE: If a run of Ruler fails (e.g., due to out-of-memory issues on the VM),
parse.js
will log a "Failed to parse" line when parsing the corresponding .log
file.
Data from the failed runs are ignored by the plotting scripts;
only data from files that succeed parsing will be included in the resulting plots.
NOTE: You can also separately invoke the parsing and visualization scripts on the data like:
node parse.js "output/$TIMESTAMP/compare/"
node parse.js "output/$TIMESTAMP/no-rr/" yes
python visualize.py "output/$TIMESTAMP/"
The goal is to reproduce Table 2
.
This part of the eval requires rosette 4.0 and racket 8.0 which
are already pre-installed in this directory.
-
Type
cd $HOME/ruler/scripts/eqsat-sound
to go to the correct directory. -
To view
Table 2
directly from previous runs, run:python3 tabulate.py output/pre-gen-2021-07-06-2242/all.json
. Note that the first table is the one forbv32
, then second forbv4
, and third forrational
. Compare these printed latex tables with the ones in the paper. Other than a few timing numbers which may slightly vary due to machine differences, the tables should be similar. The cells that are empty correspond to the ones with only a-
in the paper'sTable 2
. -
To reproduce all the data, run
./eqsat-soundness.sh
. This part requires allocating at least 32 GB RAM to the VM. This should take approximately 3 hours. The data will be generated and put in a timestamped directory underoutput
. Each domain and configuration will have a directory andall.json
will contain all of them. In each directory, you will find afailed-validation.txt
file that has the rules that failed post-pass validation, apost_pass.json
which contains the result of postpass validation with SMT (rosette in this case), and arkt
script that uses rosette to validate the rules one at a time. You will notice that Rosette times out on some of the post-pass validations. For these, we had manually checked the soundness of the rules using other solvers (not reported in this document). The script will also print the number of unsound (and unknown) rules, the domain (rational
,4
,32
), the number of variables, cvec length, and sample size used for fuzzing (orsmt
for SMT based verification).
eqsat-soundness.sh
is the main script which runs all three domains (bv4
, bv32,rational
) for various cvec lengths, ways of generating cvecs, and validation approaches.postpass.sh
called ineqsat-soundness.sh
verifies the rules as a postpass, after Ruler's execution is complete.postpass.sh
uses rosette to verify the rules.aggregate.sh
gathers data into a json file.tabulate.py
generates a latex table.
This section describes how to install Ruler on a different machine, the required dependencies, and how to extend our tool for other domains.
We tested our artifact setup on a Ubuntu 20.04 VM. To install and run the entire evaluation on a fresh machine with the same OS, the following dependencies must be installed:
-
Basic tools
- git
- python3
- moreutils
- cmake
- curl
- rust
-
More specialized tools
- racket 7.9 (for Herbie)
- racket 8.0 (for rosette)
- rosette 4.0
- cvc4 version 1.8
- herbie (see the
herbie-eval.sh
script for version info)
-
Tools for generating tables and plots
- pandoc
- xsv
- python3-distutils
- jq
- matplotlib
- node (version > 15)
Ruler is implemented in Rust.
You can install Rust here.
You can then get the code from Zenodo and
run the tool as described below.
To build Ruler, type cargo build --release
.
This should take ~40 min.
You can generate rules for a domain
as follows:
cargo run --bin domain --release -- synth --iters i --variables v
Type cargo domain --help
to see all available flags and parameters.
- The source code resides in the
src
directory.- The main algorithm of Ruler is implemented in
lib.rs
. equality.rs
defines a rewrite rule.derive.rs
has the code for running the derivability checks (e.g., seeSections 4, 5
).util.rs
has some small helper functions.convert_sexp.rs
has code that converts the rewrites from CVC4's format to Ruler's format.bv.rs
has a generic implemenation of bitvectors which are specialized to various sizes insrc/bin/
src/bin/
also contains the implementation of other domains including rationals and bools. There are some prototype implementations (floats, strings, bigints) that are not evaluated in the paper --- these are work in progress, but should give an idea of how to add support for other domains. See below for more information on supporting other domains.
- The main algorithm of Ruler is implemented in
scripts
has all the scripts used for evaluating Ruler --- each is in a designated subdirectory.
Ruler's goal is to support rewrite inference for new domains,
given a grammar, an interpreter, and a validation technique.
We have already generated documentation for you.
Open target/doc/ruler/index.html
in your preferred browser to navigate the documentation.
You can generate documentation on your own in a new machine by running:
cargo doc --no-deps
To run Ruler with different flags (documentation at SynthParams.html
)
see the various example usages in .cargo/config
and try replacing them with other values and look at the results!
For example, you can try
cargo run --release --bin rational -- synth --num-fuzz 10 --iters 2
to synthesize rewrite rules for rationals till depth 2 using fuzzing (with 10 values) for rule validation.
To understand how to add support for a new domain,
you can look at the documentation of the various supported domains like
rational
(target/doc/rational/index.html
),
rational_new_div
(target/doc/rational_new_div/index.html
, relevant for Section 6.3
in the paper),
bool
(target/doc/bool/index.html
), etc.
Note that some domains (e.g., floats, strings, bigints)
are experimental and not reported in the paper,
but they all provide examples of how you can add support for new domains.