Dimensionality reduction (DR) algorithms typically assume the data they are given is uniformly sampled from some underlying manifold. When this is not the case, and there are observation-gaps along the manifold, these algorithms may fail to detect a single connected entity. This repository presents two manifold approximation approaches based on minimum spanning trees (MST) for non-uniform sampled data.
The noisy minimum spanning tree union (
We add Gaussian noise (
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.noisy_mst import NoisyMST
X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = NoisyMST(num_trees=10, noise_fraction=1.0).fit(X)
# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col
plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
linewidth=0.2,
zorder=-1,
alpha=0.5,
color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
The k-nearest Minimum Spanning Tree (
To avoid creating shortcuts in the manifold, a distance threshold
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst import KMST
X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMST(num_neighbors=3, epsilon=2.0).fit(X)
# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col
plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
linewidth=0.2,
zorder=-1,
alpha=0.5,
color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
Computing
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst_descent import KMSTDescent
X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMSTDescent(num_neighbors=3, epsilon=2.0).fit(X)
# Draw the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col
plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
linewidth=0.2,
zorder=-1,
alpha=0.5,
color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
The multi_mst
package can be installed from pypi:
pip install multi_mst
Most code---including the numba KDTree, disjoint set and Boruvka MST construction implementation---is adapted from fast_hdbscan.
multi_mst
uses the same license as fast_hdbscan
: BSD (2-clause). See the
LICENSE file for details.