這一個月裡,我們介紹了如何集成(Ensemble)一堆分群(Clustering)演算法,包含 Majority Vote、Iterative Voting Consensus、Iterative Probabilistic Voting Consensus。今天,我們要介紹一樣由 Nguyen and Caruana 所提出的 Iterative Pairwise Consensus。
一、相似矩陣(Similarity Matrix)
相似矩陣是一個 N 乘 N 的矩陣,N 的大小即為資料數,第 i 行、第 j 列代表「有多少比例的基學習器(Base Learner)將第 i 筆資料跟第 j 筆資料放在同一群」。建立好相似矩陣後,就能夠根據矩陣內容來找到最佳的集成結果。
以下是建立相似矩陣的範例程式。X 是一個 2 維的 list ,用來記錄所有基學習器的分群結果。第 i 行代表第 i 個基學習器的輸出,第 j 列代表該基學習器給第 j 筆資料的索引。
程式中,我們抓出第 i 筆資料,將此資料跟第 j 筆資料比較,看看有多少比例基學習器將這 2 筆資料放在同一群。
def get_matrix(X):
n_sample = len(X[0])
n_ensemble = len(X)
matrix = np.zeros((n_sample, n_sample))
for i in range(n_sample):
for j in range(n_sample):
for k in range(n_ensemble):
if(X[k][i] == X[k][j]):
matrix[i][j] = matrix[i][j] + 1
return matrix / n_ensemble
二、Iterative Pairwise Consensus
當我們獲得相似矩陣後,就可以開始做集成。Iterative Pairwise Consensus 的公式如下:
首先,我們先來看公式中的相似度 S 計算方式。資料 x 跟 Pi 子群的相似程度,等於「資料 x」跟「Pi 子群內所有資料」的相似度加總(資料間的相似度已經存在相似矩陣),接著再除以 Pi 子群的資料總個數。
接著,我們看資料 x 跟哪一個子群的相似度最高,即為資料 x 所屬的子群。下方為此演算法的範例程式。
def KMeans_ensemble(matrix, n_clusters):
dist = np.zeros(n_clusters)
result = np.random.randint(0, n_clusters, size = len(X))
for iteration in range(20):
for i in range(len(matrix)):
for j in range(n_clusters):
data_index = [i for i in range(len(matrix))
if result[i] == j]
if(len(data_index) != 0):
dist[j] = np.sum(matrix[i][data_index])
/ len(data_index)
result[i] = np.argmax(dist)
return result.astype(int)
三、比較
這三週,我們分別介紹了 Iterative Voting Consensus、Iterative Probabilistic Voting Consensus、Iterative Pairwise Consensus。現在,我們稍微做一個比較。
Iterative Voting Consensus 跟 K-Means 演算法最像,就是把 K-Means 演算法中,計算距離的方法換成 Hamming Distance,計算子群中心點的方法換成取眾數。
Iterative Probabilistic Voting Consensus 主要的概念是比較一筆資料跟某群組所有資料的特徵,到底有多少是一樣。
Iterative Pairwise Consensus 主要概念是比較一筆資料跟某群組所有資料的相似度,而相似度紀錄有多少比例的基學習器將 2 筆資料放在同一個子群。
而上述這三個方法,與 Majority Vote 最大的差異,在於這 3 個方法都是屬於 EM Algorithm,操作的過程中會有一些隨機性,使得每一次執行的結果,並不一定都相同,因此可以根據應用場景或是評價指標,選擇成果較好的集成。但是,Majority Vote 只會有唯一的輸出,即便該輸出並不是很好的集成結果,也很難做調整。
參考資料
Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.
關於作者
Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.
完整程式
import numpy as np
import copy
from sklearn import datasets
from sklearn.metrics import accuracy_scoredef get_matrix(X):
n_sample = len(X[0])
n_ensemble = len(X)
matrix = np.zeros((n_sample, n_sample))
for i in range(n_sample):
for j in range(n_sample):
for k in range(n_ensemble):
if(X[k][i] == X[k][j]):
matrix[i][j] = matrix[i][j] + 1
return matrix / n_ensembledef KMeans(X, n_clusters):
dist = np.zeros(n_clusters)
result = np.zeros(len(X))
center = X[np.random.randint(0, len(X), size = n_clusters)]
for iteration in range(20):
for i in range(len(X)):
for j in range(n_clusters):
dist[j] = sum((np.array(X[i]) - np.array(center[j]))
** 2)
result[i] = np.argmin(dist)
for j in range(n_clusters):
center[j] = np.mean([X[i] for i in range(len(X))
if result[i] == j], axis = 0)
return result.astype(int)def KMeans_ensemble(matrix, n_clusters):
dist = np.zeros(n_clusters)
result = np.random.randint(0, n_clusters, size = len(X))
for iteration in range(20):
for i in range(len(matrix)):
for j in range(n_clusters):
data_index = [i for i in range(len(matrix))
if result[i] == j]
if(len(data_index) != 0):
dist[j] = np.sum(matrix[i][data_index])
/ len(data_index)
result[i] = np.argmax(dist)
return result.astype(int)def get_similarity(X, Y):
z = [X[i] == Y[i] for i in range(len(X))]
return sum(z)def get_optimal_label(X):
for index in range(1, len(X)):
best = -1
label = [-1, -1, -1]
for label_0 in range(3):
for label_1 in range(3):
for label_2 in range(3):
p = copy.deepcopy(X[index])
p = ["A" if a == 0 else a for a in p]
p = ["B" if a == 1 else a for a in p]
p = ["C" if a == 2 else a for a in p]
p = [label_0 if a == "A" else a for a in p]
p = [label_1 if a == "B" else a for a in p]
p = [label_2 if a == "C" else a for a in p] score = get_similarity(p, X[0])
if(score > best):
best = score
label = [label_0, label_1, label_2]
X[index] = ["A" if a == 0 else a for a in X[index]]
X[index] = ["B" if a == 1 else a for a in X[index]]
X[index] = ["C" if a == 2 else a for a in X[index]]
X[index] = [label[0] if a == "A" else a for a in X[index]]
X[index] = [label[1] if a == "B" else a for a in X[index]]
X[index] = [label[2] if a == "C" else a for a in X[index]]
return np.array(X)iris = datasets.load_iris()
X = iris.data
Y = iris.targetn_cluster = 3
n_ensemble = 99
p = []for i in range(n_ensemble):
p.append(KMeans(X, n_cluster))p = get_optimal_label(p)matrix = get_matrix(p)ensemble = KMeans_ensemble(matrix, n_cluster)print(ensemble)