Group members // Francesco Pio Barone, Gianmarco Nagaro Quiroz, Daniele Ninni
This is our final project for Advanced Statistics for Physics Analysis. In this work, we implement the K2 algorithm and test its performance in learning the topology of a Bayesian network starting from a database of cases. Moreover, we code the algorithm inside the bnstruct R package.
A Bayesian network (also known as a belief network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG).
Such DAGs consist of:
-
$\color{red}{\text{nodes}}$ $\leftrightarrow$ variables in the Bayesian sense (observable quantities, latent variables, unknown parameters or hypotheses)- unconnected nodes
$\leftrightarrow$ conditionally independent variables - each node
$\leftrightarrow$ probability function (INPUT: a particular set of values for the node's parent variables, OUTPUT: probability of the variable represented by the node)
- unconnected nodes
-
$\color{blue}{\text{edges}}$ $\leftrightarrow$ conditional dependencies
Given a database of records, it is interesting to construct a Bayesian network which can provide insights into the probabilistic dependencies existing among the variables in the database itself. Such network can be further used to predict the future behaviour of the system thus modeled.
In the simplest case, a Bayesian network has a small number of nodes and therefore can be specified manually by an expert. Even in the case of a large number of nodes, an expert could be able to assess the network structure, if his knowledge of the system is sufficient. In general, it is crucial to learn the network topology following a data-driven approach.
Automatically learning the graph structure of a Bayesian network is a challenge pursued within machine learning.
For instance,
- a scoring function
- a search strategy
Such score-based algorithms focus on the DAG as a whole and approach the topology learning task as an optimization problem:
- each candidate DAG is assigned a score which represents a statistical measurement of how well the DAG itself mirrors the dependence structure of the data
- the output DAG is the one corresponding to the highest score
The K2 algorithm is a score-based algorithm introduced by G. F. Cooper and E. Herskovits in 1992. It allows to learn the topology of a Bayesian network, i.e. to find the most probable belief network structure, given a database of cases.
The algorithm holds on the following assumptions:
- there is an ordering on the domain variables
- a priori, all structures are considered equally likely
The algorithm is based on the use of a greedy-search method. In particular, for each node:
- it begins by making the assumption that the current node has no parents
- it adds incrementally that parent whose addition most increases the probability of the resulting structure
- when the addition of no single parent can increase the probability, it stops adding parents to the current node and moves on to the next node
- M. Scutari and J. B. Denis, Bayesian Networks, CRC Press, 2022, Taylor and Francis Group
- G. F. Cooper and E. Herskovits, A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning 9, (1992) 309
- C. Ruiz, Illustration of the K2 Algorithm for learning Bayes Net Structures web.cs.wpi.edu
- A. Franzin et al., bnstruct: an R package for Bayesian Network structure learning in the presence of missing data, Bioinformatics 33(8) (2017) 1250
- F. Sambo and A. Franzin, bnstruct: an R package for Bayesian Network Structure Learning with missing data, December 12, 2016