Skip to content

jeshi96/parallel-correlation-clustering

Repository files navigation

Shared-memory Parallel Clustering

Organization

This repository contains the implementations for our shared-memory parallel and sequential algorithms for correlation and modularity clustering. Note that the repository uses the Graph-Based Benchmark Suite (GBBS) for parallel primitives and benchmarks.

clustering contains the main executable, cluster-in-memory_main. It serves as a central runner for all of our implementations.

Installation

Compiler:

  • g++ >= 7.4.0 with support for Cilk Plus
  • g++ >= 7.4.0 with pthread support for the ParlayLib scheduler

Build system:

  • Bazel >= 3.5.0 and < 4.0.0

To build:

$ bazel build //clustering:cluster-in-memory_main

Most optionality from the Graph Based Benchmark Suite (GBBS) applies. In particular, to compile benchmarks for graphs with more than 2^32 edges, the LONG command-line parameter should be set.

Also, the parallel scheduler can be chosen using comand-line parameters. The default compilation is the ParlayLib scheduler (Homegrown), and the parameters HOMEGROWN, CILK, OPENMP, and SERIAL switch between the Homegrown scheduler, Cilk Plus, OpenMP, and serial respectively.

Graph Format

The applications take as input the adjacency graph format used by Graph Based Benchmark Suite (GBBS).

Configurations

All clusterers take configs given in a protobuf, as detailed in config.proto, which can be passed in by setting the input clusterer_config. Note that the proto should be passed in text format. For both correlation and modularity clustering, a CorrelationClustererConfig should be passed in.

The options available in CorrelationClustererConfig are:

  • resolution: double indicating resolution parameter
  • louvain_config: LouvainConfig object, with
    • num_iterations: int specifying the number of iterations of best moves and compression steps
    • num_inner_iterations: int specifying the number of iterations of local best move steps
  • refine: bool indicating whether to perform multi-level refinement
  • async: bool indicating whether to use the asynchronous setting
  • move_method: MoveMethod enum, that sets the subset of vertices to consider for best moves to be:
    • NBHR_CLUSTER_MOVE: neighbors of clusters that have been modified
    • ALL_MOVE: all vertices
    • NBHR_MOVE: neighbors of vertices that have moved
  • all_iter: bool solely for the sequential implementations; indicates if the restriction on the number of iterations should be ignored and if the program should run to convergence
  • permute: bool solely for the parallel implementations; indicates whether to randomly permute vertices before performing best moves, which may improve performance when there is greater contention from the asynchronous setting

Example Usage

The main executable is cluster-in-memory_main in the clustering directory. It will run the clusterer given by the input clusterer_name: "ParallelCorrelationClusterer", "ParallelModularityClusterer", "CorrelationClusterer", or "ModularityClusterer".

A template command is:

bazel run -c opt  //clustering:cluster-in-memory_main -- --input_graph=</path/to/graph> --output_clustering=</path/to/output> --clusterer_name=<clusterer name> --clusterer_config='<clusterer proto>'

An example <clusterer proto> is correlation_clusterer_config {resolution: 0.01, refine: true, async: true, move_method: NBHR_MOVE}.

Paper

The full version of the paper can be found here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published