refactor canonicalization scheme out of SddBuilder into trait; crea…

…te generic `SddBuilder` (#56) The main goal of this *very hefty* PR is to allow two **canonicalization schemes**: - one based on the canonicity property, which requires compression to perform pointer equality - one based on "semantic hashing", where we instead uniquify an Sdd based on a weighted model count with some randomly-generated weights After talking to Steven, we chose to use a mixin approach - the `SddBuilder` becomes generic, taking in a "canonicalization scheme" trait. Each implementer of the trait needs to be able to do any operation related to checking if two SDDs are equal - this includes SDD equality, but also a host of related backend work, including: - managing a `BackedRobinhoodTable` for BinarySDDs and SDDs - implementing a hasher for the above table, as well as utility get and insert functions - cache the apply operation To properly implement the above, I also had to apply the same genericizing approach to: - the apply cache, which now conforms to an `ApplyCacheMethod` - the hasher for the BackedRobinhoodTable, which now allows an arbitrary hash function rather than `FxHasher` - which becomes the semantic hash when necessary. To avoid polluting this struct, I instead require a hash function to be passed to get/insertion. I pulled all of these functions out of the `SddBuilder`; the defaults move to the `CompressionCanonicalizer`, which should behave **exactly the same** as the previous `SddBuilder`. I've soft-verified this with tests. Since I'm making a complex struct generic, this PR also touches many files - mostly replacing instances of a naked `SddBuilder` with `SddBuilder<CompressionCanonicalizer>`. I've localized "new implementation code" to `canonicalize.rs` and the files for the apply cache and bump table. I've also manually added some WMC code for `SddAnd`, `SddOr`, and `BinarySdd` that I'd like to remove - see below. I've also temporarily added a `sub` operator to the finite field. I'm merging this PR in because: 1. I'm confident that this doesn't change the default behaviour, and 2. I'm worried about branch divergence ## next steps There are a handful of things that still should be done. - undoing my temporary hacks: - ideally, I should deduplicate the semantic hashing work from the hard-coded implementations for `SddAnd`, `SddOr`, and `BinarySdd`. I'd like to somehow make this a WMC, but I'm not sure how to do that without recreating a new `SddPtr` - if we plan on keeping the finite field as a semiring, we should remove my `-` before it gets into the rest of the codebase - running larger benchmarks on number of nodes, correctness metrics - I've ... bricked my instance I think? Or, I can't log in. But, the plan was to: - try a handful of different primes - likely within a few orders of magnitude - try different CNFs of varying complexity - measure: - overall correctness - and how it scales with primes - number of nodes between uncompressed, compression-compressed, and semantic-compression; I may have to exclude the uncompressed versions for large CNFs - optimizing references over cloning - right now, there's a *ton* of cloning that goes on since I was more concerned about correctness than efficiency. It's mostly localized to creating the `Canonicalizer`, so it can be amortized across SDD operations; but, I think it warrants a serious look - in particular, most of the Canonicalizer's operations aren't mutable anyways, so with some shuffling around I should be able to make most of them immutable read references anyways! ## testing ### unit tests, quickcheck I've written a few more unit tests and quickcheck tests to verify that this works for small cases. Even with a relatively large quickcheck sample size, the semantic canonicalizer still maintains its uniqueness benefits most of the time, which is great! (I planned on doing some more testing on the server, but I'm now curiously locked out of my account after running a large test - concerning) Some useful quickcheck tests to run: ``` $ QUICKCHECK_TESTS=20000 cargo test prob_equiv_sdd_eq_vs_prob_eq $ QUICKCHECK_TESTS=20000 cargo test prob_equiv_sdd_inequality $ QUICKCHECK_TESTS=20000 cargo test qc_sdd_canonicity ``` ### script I've rewritten the compare-canonicalize script to better test this difference. Here's an example of usage: ```sh $ cargo run --bin semantic_hash_experiment -- --file cnf/rand-3-25-75-1.cnf ``` Interestingly, I haven't been able to get substantial differences in the number of nodes for small CNFs. I think this is a bug in my implementation or my benchmark. However, given that the default canonicalizer still works as intended, I'm thinking that we merge this in now before it diverges more from `main`.
neuppl · Mar 20, 2023 · 3b08d01 · 3b08d01
1 parent bccbd0d
commit 3b08d01
Show file tree

Hide file tree

Showing 16 changed files with 652 additions and 368 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -57,5 +57,5 @@ name = "bayesian_network_compiler"
 path = "bin/bayesian_network_compiler.rs"
 
 [[bin]]
-name = "compare_canonicalize"
-path = "bin/compare_canonicalize.rs"
+name = "semantic_hash_experiment"
+path = "bin/semantic_hash_experiment.rs"
diff --git a/bin/bayesian_network_compiler.rs b/bin/bayesian_network_compiler.rs
@@ -4,6 +4,7 @@ extern crate rsgm;
 
 use clap::Parser;
 use rsdd::builder::cache::all_app::AllTable;
+use rsdd::builder::canonicalize::*;
 use rsdd::builder::decision_nnf_builder::DecisionNNFBuilder;
 use rsdd::builder::sdd_builder;
 use rsdd::repr::bdd::BddPtr;
@@ -287,7 +288,9 @@ fn compile_sdd_cnf(network: BayesianNetwork) {
     println!("Dtree built\nNumber of variables: {}\n\tNumber of clauses: {}\n\tWidth: {}\n\tElapsed dtree time: {:?}",
         bn.cnf.num_vars(), bn.cnf.clauses().len(), dtree.cutwidth(), duration);
 
-    let mut compiler = sdd_builder::SddManager::new(VTree::from_dtree(&dtree).unwrap());
+    let mut compiler = sdd_builder::SddManager::<CompressionCanonicalizer>::new(
+        VTree::from_dtree(&dtree).unwrap(),
+    );
 
     println!("Compiling");
     let start = Instant::now();

diff --git a/bin/compare_canonicalize.rs b/bin/compare_canonicalize.rs
diff --git a/bin/one_shot_benchmark.rs b/bin/one_shot_benchmark.rs
@@ -6,6 +6,7 @@ extern crate serde_json;
 use clap::Parser;
 use rsdd::builder::bdd_plan::BddPlan;
 use rsdd::builder::cache::lru_app::BddApplyTable;
+use rsdd::builder::canonicalize::*;
 use rsdd::repr::cnf::Cnf;
 use rsdd::repr::ddnnf::DDNNFPtr;
 use rsdd::repr::dtree::DTree;
@@ -94,8 +95,7 @@ fn compile_sdd_dtree(str: String, _args: &Args) -> BenchResult {
     let cnf = Cnf::from_file(str);
     let dtree = DTree::from_cnf(&cnf, &VarOrder::linear_order(cnf.num_vars()));
     let vtree = VTree::from_dtree(&dtree).unwrap();
-    let mut man = SddManager::new(vtree.clone());
-    // man.set_compression(false);
+    let mut man = SddManager::<CompressionCanonicalizer>::new(vtree.clone());
     let _sdd = man.from_cnf(&cnf);
 
     if let Some(path) = &_args.dump_sdd {
@@ -128,7 +128,7 @@ fn compile_sdd_rightlinear(str: String, _args: &Args) -> BenchResult {
         .map(|x| VarLabel::new(x as u64))
         .collect();
     let vtree = VTree::right_linear(&o);
-    let mut man = SddManager::new(vtree.clone());
+    let mut man = SddManager::<CompressionCanonicalizer>::new(vtree.clone());
     let _sdd = man.from_cnf(&cnf);
 
     if let Some(path) = &_args.dump_sdd {

diff --git a/bin/semantic_hash_experiment.rs b/bin/semantic_hash_experiment.rs
@@ -0,0 +1,104 @@
+extern crate rsdd;
+
+use clap::Parser;
+use rsdd::{
+    builder::{
+        canonicalize::{
+            CompressionCanonicalizer, SddCanonicalizationScheme, SemanticCanonicalizer,
+        },
+        sdd_builder::SddManager,
+    },
+    repr::{cnf::Cnf, var_label::VarLabel, vtree::VTree},
+};
+use std::fs;
+use std::time::Instant;
+
+#[derive(Parser, Debug)]
+#[clap(author, version, about, long_about = None)]
+struct Args {
+    /// An input Bayesian network file in JSON format
+    #[clap(short, long, value_parser)]
+    file: String,
+}
+
+fn run_canonicalizer_experiment(c: Cnf, vtree: VTree) {
+    let start = Instant::now();
+    println!("creating uncompressed...");
+
+    let mut uncompr_mgr = SddManager::<CompressionCanonicalizer>::new(vtree.clone());
+    uncompr_mgr.set_compression(false);
+    let uncompr_cnf = uncompr_mgr.from_cnf(&c);
+
+    let duration = start.elapsed();
+    println!("time: {:?}", duration);
+
+    let start = Instant::now();
+    println!("creating compressed...");
+
+    let mut compr_mgr = SddManager::<CompressionCanonicalizer>::new(vtree.clone());
+    let compr_cnf = compr_mgr.from_cnf(&c);
+
+    let duration = start.elapsed();
+    println!("time: {:?}", duration);
+
+    let start = Instant::now();
+    println!("creating semantic...");
+
+    // 18,446,744,073,709,551,616 - 25
+    // TODO: make the prime a CLI arg
+    let mut sem_mgr = SddManager::<SemanticCanonicalizer<18_446_744_073_709_551_591>>::new(vtree);
+    let sem_cnf = sem_mgr.from_cnf(&c);
+    // saving this before we run sdd_eq, which could add nodes to the table
+    let sem_uniq = sem_mgr.canonicalizer().bdd_num_uniq() + sem_mgr.canonicalizer().sdd_num_uniq();
+
+    let duration = start.elapsed();
+    println!("time: {:?}", duration);
+
+    if !sem_mgr.sdd_eq(uncompr_cnf, sem_cnf) {
+        println!("not equal! not continuing with test...");
+        println!("uncompr sdd: {}", sem_mgr.print_sdd(uncompr_cnf));
+        println!("sem sdd: {}", sem_mgr.print_sdd(sem_cnf));
+        return;
+    }
+
+    if !sem_mgr.sdd_eq(compr_cnf, sem_cnf) {
+        println!("not equal! not continuing with test...");
+        println!("compr sdd: {}", sem_mgr.print_sdd(compr_cnf));
+        println!("sem sdd: {}", sem_mgr.print_sdd(sem_cnf));
+        return;
+    }
+
+    println!(
+        "uncompr: {} nodes, {} uniq",
+        uncompr_cnf.num_nodes(),
+        uncompr_mgr.canonicalizer().bdd_num_uniq() + uncompr_mgr.canonicalizer().sdd_num_uniq()
+    );
+
+    println!(
+        "compr: {} nodes, {} uniq",
+        compr_cnf.num_nodes(),
+        compr_mgr.canonicalizer().bdd_num_uniq() + compr_mgr.canonicalizer().sdd_num_uniq()
+    );
+
+    println!("sem: {} nodes, {} uniq", sem_cnf.num_nodes(), sem_uniq);
+}
+
+fn main() {
+    let args = Args::parse();
+
+    let cnf_input = fs::read_to_string(args.file).expect("Should have been able to read the file");
+
+    let cnf = Cnf::from_file(cnf_input);
+    println!("{}", cnf.num_vars());
+
+    let range: Vec<usize> = (0..cnf.num_vars() + 1).collect();
+    let binding = range
+        .iter()
+        .map(|i| VarLabel::new(*i as u64))
+        .collect::<Vec<VarLabel>>();
+    let vars = binding.as_slice();
+
+    let vtree = VTree::right_linear(vars);
+
+    run_canonicalizer_experiment(cnf, vtree);
+}
diff --git a/src/backing_store/bump_table.rs b/src/backing_store/bump_table.rs
@@ -1,15 +1,13 @@
 //! A unique table based on a bump allocator and robin-hood hashing
 //! this is the primary unique table for storing all nodes
 
-use super::UniqueTable;
+use super::{UniqueTable, UniqueTableHasher};
 use bumpalo::Bump;
-use std::hash::{Hash, Hasher};
+use std::hash::Hash;
 use std::mem;
 
 use crate::util::*;
 
-use rustc_hash::FxHasher;
-
 /// The load factor of the table, i.e. how full the table will be when it
 /// automatically resizes
 const LOAD_FACTOR: f64 = 0.7;
@@ -150,20 +148,20 @@ where
     //     (total as f64) / (self.len as f64)
     // }
 
-    #[allow(dead_code)]
     pub fn num_nodes(&self) -> usize {
         self.len
     }
 }
 
-impl<T: Eq + PartialEq + Hash + Clone> UniqueTable<T> for BackedRobinhoodTable<T> {
-    fn get_or_insert(&mut self, elem: T) -> *mut T {
+impl<T: Eq + PartialEq + Hash + Clone, H: UniqueTableHasher<T>> UniqueTable<T, H>
+    for BackedRobinhoodTable<T>
+{
+    fn get_or_insert(&mut self, elem: T, hasher: &H) -> *mut T {
         if (self.len + 1) as f64 > (self.cap as f64 * LOAD_FACTOR) {
             self.grow();
         }
-        let mut hasher = FxHasher::default();
-        elem.hash(&mut hasher);
-        let elem_hash = hasher.finish();
+
+        let elem_hash = hasher.u64hash(&elem);
 
         // the current index into the array
         let mut pos: usize = (elem_hash as usize) % self.cap;