A Python package for working with the Information Bottleneck [Tishby, Pereira, Bialek 2001] and the Deterministic (and Generalized) Information Bottleneck [Strouse and Schwab 2016]. Embo is especially geared towards the analysis of concrete, finite-size data sets.
embo
requires Python 3, numpy>=1.7
and scipy
.
To install the latest release, run:
pip install embo
(depending on your system, you may need to use pip3
instead of pip
in the command above).
(requires setuptools
). If embo
is already installed on your
system, look for the copy of the test_embo.py
script installed
alongside the rest of the embo
files and execute it. For example:
python /usr/lib/python3.X/site-packages/embo/test/test_embo.py
We refer to [Tishby, Pereira, Bialek 2001] and [Strouse and Schwab 2016] for a general introduction to the Information Bottleneck. Briefly, if X and Y are two random variables, we are interested in finding another random variable M (called the "bottleneck" variable) that solves the following optimisation problem:
min_{p(m|x)} = H(M) - α H(M|X) - β I(M:Y)
for any β>0 and 0≤α≤1, and where M is constrained to be independent on Y conditional on X:
p(x,m,y) = p(x)p(m|x)p(y|x)
Intuitively, we want to find the stochastic mapping p(M|X) that extracts from X as much information about Y as possible while forgetting all irrelevant information. β is a free parameter that sets the relative importance of forgetting irrelevant information versus remembering useful information. α determines what notion of "forgetting" we use: α=1 ("vanilla" bottleneck or IB) implies that we want to minimise the mutual information I(M:X), α=0 (deterministic bottleneck or DIB) that we want to make M a good compression of X by minimising its entropy H(M), and intermediate values interpolate between these two conditions.
Typically, one is interested in the curve described by I(M:Y) as a function of I(M:X) or H(M) at the solution of the bottleneck problem for a range of values of β. This curve gives the optimal tradeoff of compression and prediction, telling us what is the minimum amount of information one needs to know about X (or minimum amount of entropy one needs to retain) to be able to predict Y to a certain accuracy, or vice versa, what is the maximum accuracy one can have in predicting Y given a certain amount of information about X.
Embo can solve the information bottleneck problem for discrete random
variables starting from a set of joint empirical observations. The
main point of entry to the package is the InformationBottleneck
class. In its constructor, InformationBottleneck
takes as arguments an
array of observations for X and an (equally long) array of
observations for Y, together with other optional parameters (see the
docstring for details). In the most basic use case, users can call the
get_bottleneck
method of an InformationBottleneck
object, which will
assume α=1 and return a set of β values and the optimal values of
I(M:X), I(M:Y) and H(M) corresponding to those β. The optimal tradeoff
can then be visualised by plotting I(M:Y) vs I(M:Y).
For instance:
import numpy as np
from matplotlib import pyplot as plt
from embo import InformationBottleneck
# data sequences
x = np.array([0,0,0,1,0,1,0,1,0,1])
y = np.array([0,1,0,1,0,1,0,1,0,1])
# compute the IB bound from the data (vanilla IB; Tishby et al 2001)
I_x,I_y,H_m,β = InformationBottleneck(x,y).get_bottleneck()
# plot the IB bound
plt.plot(I_x,I_y)
Embo can also operate starting from a joint (X,Y) probability distribution, encoded as a 2D array containing the probability of each combination of states for X and Y.
# define joint probability mass function for a 2x2 joint pmf
pxy = np.array([[0.1, 0.4],[0.35, 0.15]]),
# compute IB
I_x,I_y,H_m,β = InformationBottleneck(pxy=pxy).get_bottleneck()
# plot I(M:Y) vs I(M:X)
plt.plot(I_x,I_y)
The deterministic and generalised bottleneck can be computed by
setting appropriately the parameter alpha
:
# compute Deterministic Information Bottleneck (Strouse 2016)
I_x,I_y,H_m,β = InformationBottleneck(pxy=pxy, alpha=0).get_bottleneck()
# plot I(M:Y) vs H(M)
plt.plot(H_m,I_y)
The embo/examples
directory contains some Jupyter notebook that
should exemplify most of the package's functionality.
- Basic-Example.ipynb: basics; how to compute and plot an IB bound.
- Markov-Chains.ipynb: using embo for past-future bottleneck type analyses on data from Markov chains.
- Deterministic-Bottleneck.ipynb: Deterministic and Generalized Information Bottleneck. Here we reproduce a key figure from the Deterministic Bottleneck paper, and we explore the algorithm's behaviour as α changes from 0 to 1.
- Compare-embo-dit.ipynb: here we compare embo with dit [James et al 2018]. We compare the solutions found by the two packages on a set of simple IB problems (including a problem taken from dit's documentation), and we show that embo is orders of magnitude faster than dit.
For more details, please consult the docstrings in
InformationBottleneck
.
See the CHANGELOG.md file for a list of changes from older versions.
embo
is maintained by Eugenio Piasini, Alexandre Filipowicz and
Jonathan Levine.
Piasini, E., Filipowicz, A.L.S., Levine, J. and Gold, J.I., 2021. Embo: a Python package for empirical data analysis using the Information Bottleneck. Journal of Open Research Software, 9(1), p.10. DOI: http://doi.org/10.5334/jors.322