機器學習動手做Lesson 21 — 使用Iterative Pairwise Consensus集成分群演算法

10 min readMar 14, 2022

這一個月裡，我們介紹了如何集成（Ensemble）一堆分群（Clustering）演算法，包含 Majority Vote、Iterative Voting Consensus、Iterative Probabilistic Voting Consensus。今天，我們要介紹一樣由 Nguyen and Caruana 所提出的 Iterative Pairwise Consensus。

一、相似矩陣（Similarity Matrix）

相似矩陣是一個 N 乘 N 的矩陣，N 的大小即為資料數，第 i 行、第 j 列代表「有多少比例的基學習器（Base Learner）將第 i 筆資料跟第 j 筆資料放在同一群」。建立好相似矩陣後，就能夠根據矩陣內容來找到最佳的集成結果。

以下是建立相似矩陣的範例程式。X 是一個 2 維的 list ，用來記錄所有基學習器的分群結果。第 i 行代表第 i 個基學習器的輸出，第 j 列代表該基學習器給第 j 筆資料的索引。

程式中，我們抓出第 i 筆資料，將此資料跟第 j 筆資料比較，看看有多少比例基學習器將這 2 筆資料放在同一群。

def get_matrix(X):
    
    n_sample = len(X[0])
    n_ensemble = len(X)
    
    matrix = np.zeros((n_sample, n_sample))
    
    for i in range(n_sample):
        for j in range(n_sample):
            for k in range(n_ensemble):
                if(X[k][i] == X[k][j]):
                    matrix[i][j] = matrix[i][j] + 1
    
    return matrix / n_ensemble

二、Iterative Pairwise Consensus

當我們獲得相似矩陣後，就可以開始做集成。Iterative Pairwise Consensus 的公式如下：

首先，我們先來看公式中的相似度 S 計算方式。資料 x 跟 Pi 子群的相似程度，等於「資料 x」跟「Pi 子群內所有資料」的相似度加總（資料間的相似度已經存在相似矩陣），接著再除以 Pi 子群的資料總個數。

接著，我們看資料 x 跟哪一個子群的相似度最高，即為資料 x 所屬的子群。下方為此演算法的範例程式。

def KMeans_ensemble(matrix, n_clusters):
    
    dist = np.zeros(n_clusters)
    result = np.random.randint(0, n_clusters, size = len(X))
    
    for iteration in range(20):
        for i in range(len(matrix)):
            for j in range(n_clusters):
                data_index = [i for i in range(len(matrix)) 
                              if result[i] == j]
                if(len(data_index) != 0):
                    dist[j] = np.sum(matrix[i][data_index]) 
                              / len(data_index)
            result[i] = np.argmax(dist)
    
    return result.astype(int)

三、比較

這三週，我們分別介紹了 Iterative Voting Consensus、Iterative Probabilistic Voting Consensus、Iterative Pairwise Consensus。現在，我們稍微做一個比較。

Iterative Voting Consensus 跟 K-Means 演算法最像，就是把 K-Means 演算法中，計算距離的方法換成 Hamming Distance，計算子群中心點的方法換成取眾數。

Iterative Probabilistic Voting Consensus 主要的概念是比較一筆資料跟某群組所有資料的特徵，到底有多少是一樣。

Iterative Pairwise Consensus 主要概念是比較一筆資料跟某群組所有資料的相似度，而相似度紀錄有多少比例的基學習器將 2 筆資料放在同一個子群。

而上述這三個方法，與 Majority Vote 最大的差異，在於這 3 個方法都是屬於 EM Algorithm，操作的過程中會有一些隨機性，使得每一次執行的結果，並不一定都相同，因此可以根據應用場景或是評價指標，選擇成果較好的集成。但是，Majority Vote 只會有唯一的輸出，即便該輸出並不是很好的集成結果，也很難做調整。

參考資料

Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
import copy
from sklearn import datasets
from sklearn.metrics import accuracy_scoredef get_matrix(X):
    
    n_sample = len(X[0])
    n_ensemble = len(X)
    
    matrix = np.zeros((n_sample, n_sample))
    
    for i in range(n_sample):
        for j in range(n_sample):
            for k in range(n_ensemble):
                if(X[k][i] == X[k][j]):
                    matrix[i][j] = matrix[i][j] + 1
    
    return matrix / n_ensembledef KMeans(X, n_clusters):
    
    dist = np.zeros(n_clusters)
    result = np.zeros(len(X))
    center = X[np.random.randint(0, len(X), size = n_clusters)]
    
    for iteration in range(20):
        for i in range(len(X)):
            for j in range(n_clusters):
                dist[j] = sum((np.array(X[i]) - np.array(center[j])) 
                          ** 2)
            result[i] = np.argmin(dist)
    
        for j in range(n_clusters):
            center[j] = np.mean([X[i] for i in range(len(X)) 
                                if result[i] == j], axis = 0)
    
    return result.astype(int)def KMeans_ensemble(matrix, n_clusters):
    
    dist = np.zeros(n_clusters)
    result = np.random.randint(0, n_clusters, size = len(X))
    
    for iteration in range(20):
        for i in range(len(matrix)):
            for j in range(n_clusters):
                data_index = [i for i in range(len(matrix)) 
                              if result[i] == j]
                if(len(data_index) != 0):
                    dist[j] = np.sum(matrix[i][data_index]) 
                              / len(data_index)
            result[i] = np.argmax(dist)
    
    return result.astype(int)def get_similarity(X, Y):
    z = [X[i] == Y[i] for i in range(len(X))]
    return sum(z)def get_optimal_label(X):
    
    for index in range(1, len(X)):
        
        best = -1
        label = [-1, -1, -1]
        
        for label_0 in range(3):
            for label_1 in range(3):
                for label_2 in range(3):
                    p = copy.deepcopy(X[index])
                    p = ["A" if a == 0 else a for a in p]
                    p = ["B" if a == 1 else a for a in p]
                    p = ["C" if a == 2 else a for a in p]
                    p = [label_0 if a == "A" else a for a in p]
                    p = [label_1 if a == "B" else a for a in p]
                    p = [label_2 if a == "C" else a for a in p]                    score = get_similarity(p, X[0])
                    
                    if(score > best):
                        best = score
                        label = [label_0, label_1, label_2]
        
        X[index] = ["A" if a == 0 else a for a in X[index]]
        X[index] = ["B" if a == 1 else a for a in X[index]]
        X[index] = ["C" if a == 2 else a for a in X[index]]
        X[index] = [label[0] if a == "A" else a for a in X[index]]
        X[index] = [label[1] if a == "B" else a for a in X[index]]
        X[index] = [label[2] if a == "C" else a for a in X[index]]
    
    return np.array(X)iris = datasets.load_iris()
X = iris.data
Y = iris.targetn_cluster = 3
n_ensemble = 99
p = []for i in range(n_ensemble):
    
    p.append(KMeans(X, n_cluster))p = get_optimal_label(p)matrix = get_matrix(p)ensemble = KMeans_ensemble(matrix, n_cluster)print(ensemble)