Question #55

nnn-niu · 2022-02-23T01:22:46Z

Sorry! I try again, bu the results are also different in differently running. May be some wrong in my procedure. Could you help and check for me.

#!/usr/bin/env python

coding: utf-8

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from gap_statistic import OptimalK
try:
from sklearn.datasets.samples_generator import make_blobs
except ImportError:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, y = make_blobs(n_samples=int(1e5), n_features=2, centers=3, random_state=25)
print('Data shape: ', X.shape)

test the first exmaple

The 1st run about gap statistic

optimalK = OptimalK(parallel_backend='rust')
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3)
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

The 2nd run about gap statistic

optimalK = OptimalK(parallel_backend='rust')
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3)
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

The 3rd run about gap statistic

optimalK = OptimalK(parallel_backend='rust')
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3)
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

test the second exmaple, I will test another exmple by fixing random_state and defining the OptimalK instance and passing in our own clustering function

def special_clustering_func(X, k):
m = KMeans(random_state=0)
m.fit(X)
return m.cluster_centers_, m.predict(X)

The first run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func)
n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3)
plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters,
optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

The second run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func)
n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3)
plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters,
optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

The third run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func)
n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3)
plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters,
optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

milesgranger · 2022-02-23T04:22:19Z

I don't know, the 'rust' backend isn't the best at convergence. Although everything seems correct with your provided cluster function in the second iteration.

I've lost interest/time in maintaining this project, so I'd be delighted if you want to poke around yourself to see why this happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question #55

Question #55

nnn-niu commented Feb 23, 2022

milesgranger commented Feb 23, 2022

Question #55

Question #55

Comments

nnn-niu commented Feb 23, 2022

coding: utf-8

test the first exmaple

The 1st run about gap statistic

The 2nd run about gap statistic

The 3rd run about gap statistic

test the second exmaple, I will test another exmple by fixing random_state and defining the OptimalK instance and passing in our own clustering function

The first run about gap statistic

The second run about gap statistic

The third run about gap statistic

milesgranger commented Feb 23, 2022