-------[Cookbook 274page]----------------------------------------------------------------------------------

＜データ準備＞

Yeastデータは、読み込みエラーが出ない様にCSVファイルに変換しました。
このページの添付データ「yeast.csv」をダウンロードして使って下さい。

http://archive.ics.uci.edu/ml/machine-learning-databases/yeast/
ダウンロードページ（yeast.csvは、yeast.dataから変換しました）
> mydata <- read.csv("yeast.csv",header=F)   データ読込み 1行目もデータならF
> dat1 <- mydata[,2:9]   2～9列目のみ抽出

> head(dat1) 
    V2   V3   V4   V5  V6  V7   V8   V9
1 0.58 0.61 0.47 0.13 0.5 0.0 0.48 0.22
2 0.43 0.67 0.48 0.27 0.5 0.0 0.53 0.22
3 0.64 0.62 0.49 0.15 0.5 0.0 0.53 0.22
4 0.58 0.44 0.57 0.13 0.5 0.0 0.54 0.22
5 0.42 0.44 0.48 0.54 0.5 0.0 0.48 0.22
6 0.51 0.40 0.56 0.17 0.5 0.5 0.49 0.22
＜k-means clusteringの実行＞

> k <- 4　　クラスタ数設定
> kmeans_res <- kmeans(dat1,k)　　　　k-means clustering実行


> summary(kmeans_res)　　結果の変数確認
             Length Class  Mode   
cluster      1484   -none- numeric
centers        32   -none- numeric
totss           1   -none- numeric
withinss        4   -none- numeric
tot.withinss    1   -none- numeric
betweenss       1   -none- numeric
size            4   -none- numeric
iter            1   -none- numeric
ifault          1   -none- numeric


> table(kmeans_res$cluster)　分類結果のクラスタ数が変数に

  1   2   3   4 
242 489 579 174 


＜結果の視覚化＞

yeast.csvのmit(5列目) x gvh(3列目)のデータを視覚化

yeast.namesに列情報記載あり

＊gvh=3列目（gvh:von Heijne's method for signal sequence recognition.）
＊mit=5列目(mit: Score of discriminant analysis of the amino acid content of
   the N-terminal region (20 residues long) of mitochondrial and 
           non-mitochondrial proteins.)


> par(mfrow=c(1,2))
> plot(dat1[,4],dat1[,2],col=kmeans_res$cluster,main="(A) k-means")
> plot(dat1[,4],dat1[,2],col=mydata[,10],main="(B) actual classes")


> kmeans_res$centers　　クラスタ中心データ表示
         V2        V3        V4        V5        V6          V7
1 0.5021074 0.4971074 0.5294628 0.5023554 0.5041322 0.005495868
2 0.3822904 0.3998978 0.5032924 0.2020859 0.5030675 0.005092025
3 0.5226079 0.5260967 0.5067185 0.1989292 0.5034542 0.011191710
4 0.7536782 0.6979310 0.4277011 0.2990230 0.5143678 0.004770115
         V8        V9
1 0.4876033 0.2595868
2 0.4963395 0.3194479
3 0.5049914 0.2550259
4 0.5099425 0.2482184

> points(kmeans_res$centers[,4],kmeans_res$centers[,2],col=1:4,pch=8,cex=2)
4つのクラスタ中心を＊マークで画像に追加表示


■最適クラスタ係数の決定

(Akaike Information Criterion: AIC)　赤池情報量規準


AIC(M)= 2MLL(M) + 2k

    M:モデル

    k:未知パラメータ数

    MLL(M):最大対数尤度

↓

AIC= res$tot.withinss + 2*(nrow)*(ncol)

    「全変数の群内平方和（tot.withinss）を最小化」
＜最適クラスタ数をAIC・BICで計算＞

source("stepkmeansAIC.R")　外部ファイル実行

STEP1: yeast.csv読込
STEP2: →kmeansクラスタリング計算
STEP3: →AIC,BIC計算、ファイル出力「tmpAICBIC.txt」
STEP4: →AICからクラスタ数計算、ファイル出力「tmpAICnum.txt」
STEP4: →図描画+ファイル出力「tmpAIC.png」
まで実行します。

同じディレクトリに「yeast.csv」と「kmeansAICBIC.R」を置いておく必要があります。





kmeansAICBIC.Rの元ファイル↓
http://stackoverflow.com/questions/15839774/how-to-calculate-bic-for-k-means-clustering-in-r
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

150623.md

150623.md

Files

150623.md

Latest commit

History

150623.md

File metadata and controls