前 2 週,我們分別用一篇文章來說明集成(Ensemble)非監督式機器學習(Unsupervised Machine Learning)演算法的困難,以及另一篇文章介紹使用共識決來處理這個問題。關於集成分群(Clustering)問題,有滿多研究者提出不同的做法,今天要介紹的是 Nam Nguyen 以及 Rich Caruana 所提出的方法(Nguyen and Caruana, 2007)。
一、回顧所有基學習器的分群結果
下表為 3 個基學習器(Base Learner)的分群結果,每一列是代表一個基學習器,每一行代表一筆資料。
+=====+===========+===========+===========+
| No. | Learner 1 | Learner 2 | Learner 3 |
+=====+===========+===========+===========+
| 0 | 0 | 1 | 2 |
+-----+-----------+-----------+-----------+
| 1 | 0 | 1 | 2 |
+-----+-----------+-----------+-----------+
| 2 | 1 | 2 | 0 |
+-----+-----------+-----------+-----------+
| 3 | 1 | 2 | 0 |
+-----+-----------+-----------+-----------+
| 4 | 2 | 0 | 1 |
+-----+-----------+-----------+-----------+
| 5 | 2 | 0 | 1 |
+-----+-----------+-----------+-----------+
我們已經知道這 3 個演算法的分群結果是一致:第 0 筆跟第 1 筆放在一群、第 2 筆跟第 3 筆放在一群、第 4 筆跟第 5 筆放在一群。
現在,請各位讀者跟著我這樣做。
第一步、請先深呼吸一口氣。
第二步、請起身伸展一下腰。
第三步、請喝一杯水...
第四步、請忘記我們是怎麼得到分群結果。現在,我們看一下剛剛的表格。
+=====+===========+===========+===========+
| No. | Feature 1 | Feature 2 | Feature 3 |
+=====+===========+===========+===========+
| 0 | 0 | 1 | 2 |
+-----+-----------+-----------+-----------+
| 1 | 0 | 1 | 2 |
+-----+-----------+-----------+-----------+
| 2 | 1 | 2 | 0 |
+-----+-----------+-----------+-----------+
| 3 | 1 | 2 | 0 |
+-----+-----------+-----------+-----------+
| 4 | 2 | 0 | 1 |
+-----+-----------+-----------+-----------+
| 5 | 2 | 0 | 1 |
+-----+-----------+-----------+-----------+
各位讀者看到什麼?表格中含有 6 筆資料,每筆資料有 3 個特徵(Feature),資料中沒有標籤(Label),現在這是一個新的分群問題!
所以,集成一大堆分群基學習器,本身就是一個分群問題。
二、Iterative Voting Consensus
Nam Nguyen 以及 Rich Caruana 所提出的方法之一:Iterative Voting Consensus(Nguyen and Caruana, 2007),便是將分群基學習器的輸出,視為另一個分群問題,並且使用類似 K-Means 演算法處理。
不過,由於現在的特徵值,只會是正整數(分群的標籤通常不是小數)。該研究者在操作 K-Means 時,有針對此特性做一些修改:
1、每筆資料跟各群中心點的距離,是使用 Hamming distance。
for i in range(len(X)):
for j in range(n_clusters):
dist[j] = sum(np.array(X[i]) != np.array(center[j]))
result[i] = np.argmin(dist)
2、我們是計算子群資料的眾數(Mode),來得到子群的中心點。
for j in range(n_clusters):
new_center = []
data = np.array([X[i] for i in range(len(X))
if result[i] == j])
if(len(data) > 0):
center[j] = frequent(data, n_ensemble)
三、使用 Iterative Voting Consensus 的好處
原始論文指出(Nguyen and Caruana, 2007),如果使用多數共識決的方法來集成分群結果,在基學習器數量太少時,其結果並不強健。而且,多數共識決的輸出只有會一種答案,如果多數共識的結果跟實際情況不太相同,也沒有辦法改變結果。
而 Iterative Voting Consensus 的好處在於,每次使用 K-Means 的時候,都含有一些隨機性(隨機產生初始的子群中心點),因此執行數次的集成,其結果都不太相同。因此,我們可以根據評價指標,選擇比較好的一次集成結果,或是即便在基學習器數量不多的情況下,也可以產生比較好的集成。
參考資料
Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.
關於作者
Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.
完整程式
import numpy as np
import copy
from sklearn import datasets
from sklearn.metrics import accuracy_scoredef frequent(x, n_ensemble):
new_center = []
for i in range(n_ensemble):
l = x[:, i].tolist()
new_center.append(max(set(l), key = l.count))
return new_centerdef KMeans(X, n_clusters):
dist = np.zeros(n_clusters)
result = np.zeros(len(X))
center = X[np.random.randint(0, len(X), size = 3)]
for iteration in range(20):
for i in range(len(X)):
for j in range(n_clusters):
dist[j] = sum((np.array(X[i]) -
np.array(center[j])) ** 2)
result[i] = np.argmin(dist)
for j in range(n_clusters):
center[j] = np.mean([X[i] for i in range(len(X)) if
result[i] == j], axis = 0)
return result.astype(int)def KMeans_ensemble(X, n_clusters, n_ensemble):
dist = np.zeros(n_clusters)
result = np.zeros(len(X))
center = X[np.random.randint(0, len(X), size = n_clusters)]
for iteration in range(20):
for i in range(len(X)):
for j in range(n_clusters):
dist[j] = sum(np.array(X[i]) != np.array(center[j]))
result[i] = np.argmin(dist)
for j in range(n_clusters):
new_center = []
data = np.array([X[i] for i in range(len(X))
if result[i] == j])
if(len(data) > 0):
center[j] = frequent(data, n_ensemble)
return result.astype(int)def get_similarity(X, Y):
z = [X[i] == Y[i] for i in range(len(X))]
return sum(z)def get_optimal_label(X):
for index in range(1, len(X)):
best = -1
label = [-1, -1, -1]
for label_0 in range(3):
for label_1 in range(3):
for label_2 in range(3):
p = copy.deepcopy(X[index])
p = ["A" if a == 0 else a for a in p]
p = ["B" if a == 1 else a for a in p]
p = ["C" if a == 2 else a for a in p]
p = [label_0 if a == "A" else a for a in p]
p = [label_1 if a == "B" else a for a in p]
p = [label_2 if a == "C" else a for a in p]
score = get_similarity(p, X[0])
if(score > best):
best = score
label = [label_0, label_1, label_2]
X[index] = ["A" if a == 0 else a for a in X[index]]
X[index] = ["B" if a == 1 else a for a in X[index]]
X[index] = ["C" if a == 2 else a for a in X[index]]
X[index] = [label[0] if a == "A" else a for a in X[index]]
X[index] = [label[1] if a == "B" else a for a in X[index]]
X[index] = [label[2] if a == "C" else a for a in X[index]]
return np.array(X)iris = datasets.load_iris()
X = iris.data
Y = iris.targetn_cluster = 3
n_ensemble = 99
p = []for i in range(n_ensemble):
p.append(KMeans(X, n_cluster))
p = get_optimal_label(p)ensemble = KMeans_ensemble(p.T, n_cluster, n_ensemble)print(ensemble)