機器學習動手做Lesson 21 — 使用Iterative Pairwise Consensus集成分群演算法

施威銘研究室
10 min readMar 14, 2022

--

這一個月裡,我們介紹了如何集成(Ensemble)一堆分群(Clustering)演算法,包含 Majority VoteIterative Voting ConsensusIterative Probabilistic Voting Consensus。今天,我們要介紹一樣由 Nguyen and Caruana 所提出的 Iterative Pairwise Consensus。

一、相似矩陣(Similarity Matrix)

相似矩陣是一個 N 乘 N 的矩陣,N 的大小即為資料數,第 i 行、第 j 列代表「有多少比例的基學習器(Base Learner)將第 i 筆資料跟第 j 筆資料放在同一群」。建立好相似矩陣後,就能夠根據矩陣內容來找到最佳的集成結果。

以下是建立相似矩陣的範例程式。X 是一個 2 維的 list ,用來記錄所有基學習器的分群結果。第 i 行代表第 i 個基學習器的輸出,第 j 列代表該基學習器給第 j 筆資料的索引。

程式中,我們抓出第 i 筆資料,將此資料跟第 j 筆資料比較,看看有多少比例基學習器將這 2 筆資料放在同一群。

def get_matrix(X):

n_sample = len(X[0])
n_ensemble = len(X)

matrix = np.zeros((n_sample, n_sample))

for i in range(n_sample):
for j in range(n_sample):
for k in range(n_ensemble):
if(X[k][i] == X[k][j]):
matrix[i][j] = matrix[i][j] + 1

return matrix / n_ensemble
Photo by Coffee Geek on Unsplash

二、Iterative Pairwise Consensus

當我們獲得相似矩陣後,就可以開始做集成。Iterative Pairwise Consensus 的公式如下:

首先,我們先來看公式中的相似度 S 計算方式。資料 x 跟 Pi 子群的相似程度,等於「資料 x」跟「Pi 子群內所有資料」的相似度加總(資料間的相似度已經存在相似矩陣),接著再除以 Pi 子群的資料總個數。

接著,我們看資料 x 跟哪一個子群的相似度最高,即為資料 x 所屬的子群。下方為此演算法的範例程式。

def KMeans_ensemble(matrix, n_clusters):

dist = np.zeros(n_clusters)
result = np.random.randint(0, n_clusters, size = len(X))

for iteration in range(20):
for i in range(len(matrix)):
for j in range(n_clusters):
data_index = [i for i in range(len(matrix))
if result[i] == j]
if(len(data_index) != 0):
dist[j] = np.sum(matrix[i][data_index])
/ len(data_index)
result[i] = np.argmax(dist)

return result.astype(int)

三、比較

這三週,我們分別介紹了 Iterative Voting Consensus、Iterative Probabilistic Voting Consensus、Iterative Pairwise Consensus。現在,我們稍微做一個比較。

Iterative Voting Consensus 跟 K-Means 演算法最像,就是把 K-Means 演算法中,計算距離的方法換成 Hamming Distance,計算子群中心點的方法換成取眾數。

Iterative Probabilistic Voting Consensus 主要的概念是比較一筆資料跟某群組所有資料的特徵,到底有多少是一樣。

Iterative Pairwise Consensus 主要概念是比較一筆資料跟某群組所有資料的相似度,而相似度紀錄有多少比例的基學習器將 2 筆資料放在同一個子群。

而上述這三個方法,與 Majority Vote 最大的差異,在於這 3 個方法都是屬於 EM Algorithm,操作的過程中會有一些隨機性,使得每一次執行的結果,並不一定都相同,因此可以根據應用場景或是評價指標,選擇成果較好的集成。但是,Majority Vote 只會有唯一的輸出,即便該輸出並不是很好的集成結果,也很難做調整。

參考資料

Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
import copy
from sklearn import datasets
from sklearn.metrics import accuracy_score
def get_matrix(X):

n_sample = len(X[0])
n_ensemble = len(X)

matrix = np.zeros((n_sample, n_sample))

for i in range(n_sample):
for j in range(n_sample):
for k in range(n_ensemble):
if(X[k][i] == X[k][j]):
matrix[i][j] = matrix[i][j] + 1

return matrix / n_ensemble
def KMeans(X, n_clusters):

dist = np.zeros(n_clusters)
result = np.zeros(len(X))
center = X[np.random.randint(0, len(X), size = n_clusters)]

for iteration in range(20):
for i in range(len(X)):
for j in range(n_clusters):
dist[j] = sum((np.array(X[i]) - np.array(center[j]))
** 2)
result[i] = np.argmin(dist)

for j in range(n_clusters):
center[j] = np.mean([X[i] for i in range(len(X))
if result[i] == j], axis = 0)

return result.astype(int)
def KMeans_ensemble(matrix, n_clusters):

dist = np.zeros(n_clusters)
result = np.random.randint(0, n_clusters, size = len(X))

for iteration in range(20):
for i in range(len(matrix)):
for j in range(n_clusters):
data_index = [i for i in range(len(matrix))
if result[i] == j]
if(len(data_index) != 0):
dist[j] = np.sum(matrix[i][data_index])
/ len(data_index)
result[i] = np.argmax(dist)

return result.astype(int)
def get_similarity(X, Y):
z = [X[i] == Y[i] for i in range(len(X))]
return sum(z)
def get_optimal_label(X):

for index in range(1, len(X)):

best = -1
label = [-1, -1, -1]

for label_0 in range(3):
for label_1 in range(3):
for label_2 in range(3):
p = copy.deepcopy(X[index])
p = ["A" if a == 0 else a for a in p]
p = ["B" if a == 1 else a for a in p]
p = ["C" if a == 2 else a for a in p]
p = [label_0 if a == "A" else a for a in p]
p = [label_1 if a == "B" else a for a in p]
p = [label_2 if a == "C" else a for a in p]
score = get_similarity(p, X[0])

if(score > best):
best = score
label = [label_0, label_1, label_2]

X[index] = ["A" if a == 0 else a for a in X[index]]
X[index] = ["B" if a == 1 else a for a in X[index]]
X[index] = ["C" if a == 2 else a for a in X[index]]
X[index] = [label[0] if a == "A" else a for a in X[index]]
X[index] = [label[1] if a == "B" else a for a in X[index]]
X[index] = [label[2] if a == "C" else a for a in X[index]]

return np.array(X)
iris = datasets.load_iris()
X = iris.data
Y = iris.target
n_cluster = 3
n_ensemble = 99
p = []
for i in range(n_ensemble):

p.append(KMeans(X, n_cluster))
p = get_optimal_label(p)matrix = get_matrix(p)ensemble = KMeans_ensemble(matrix, n_cluster)print(ensemble)

--

--

施威銘研究室
施威銘研究室

Written by 施威銘研究室

致力開發AI領域的圖書、創客、教具,希望培養更多的AI人才。整合各種人才,投入創客產品的開發,推廣「實作學習」,希望實踐學以致用的理想。

No responses yet