機器學習動手做Lesson 20 — 使用Iterative Probabilistic Voting Consensus集成分群演算法

9 min readMar 7, 2022

過去幾週，我們提到了集成（Ensemble）非監督式機器學習（Unsupervised Machine Learning）演算法的困難之處、說明共識決如何解決問題、提出 Iterative Voting Consensus 演算法。今天，我們要來介紹提出 Iterative Voting Consensus 的研究團隊，怎麼修改演算法後提出 Iterative Probabilistic Voting Consensus（Nguyen and Caruana, 2007）。

一、回顧 Iterative Voting Consensus 演算法

在上週的文章中，我們提到了集成一大堆分群基學習器，本身就是一個分群問題。因此，Iterative Voting Consensus 便是用來集成分群（Clustering）基學習器（Base Learner）的方法。此方法與 K-Means 的差異如下：

1、每筆資料跟各群中心點的距離，是使用 Hamming distance。

for i in range(len(X)):
            for j in range(n_clusters):
                dist[j] = sum(np.array(X[i]) != np.array(center[j]))
            result[i] = np.argmin(dist)

2、我們是計算子群資料的眾數（Mode），來得到子群的中心點。

for j in range(n_clusters):
            new_center = []
            data = np.array([X[i] for i in range(len(X)) 
                            if result[i] == j])
            if(len(data) > 0):
                center[j] = frequent(data, n_ensemble)

二、Iterative Probabilistic Voting Consensus 演算法

無論是 K-Means 還是 Iterative Voting Consensus 演算法，都需要計算群組的中心。而且 Iterative Voting Consensus 是使用眾數作為群組中心，所以還需要處理如果有 2 個數字一樣多的情況，有一點麻煩。Iterative Probabilistic Voting Consensus 就不會有這樣的問題了，因為此方法在決定第 i 筆資料所屬的群組時，是將第 i 筆資料與上一個迭代後各群組的資料做比較來決定。

實際上怎麼計算呢？我們來看一下數學式：

看起來有點複雜，我們仔細來說明。上式是要計算某筆資料 y 與 Pi 群組的距離，對於每個特徵（Feature）（每筆資料都有 C 個特徵），我們把 Pi 群組裡的所有資料，一個一個拿出來跟 y 比較，看看有多少特徵不同，累加之後再除以 Pi 的資料總數（ni）。將每個特徵的計算結果全部加起來，就是距離。

最後，我們挑距離 y 最小的群組，作為資料 y 的新群組，便做完分群了。範例程式如下：

for i in range(len(X)):
    for j in range(n_clusters):
        data = [X[k] for k in range(len(X)) if result[k] == j]
        if(len(data) != 0):
            dist[j] = np.sum(np.array(X[i]) != np.array(data))
            dist[j] = dist[j] / len(data)
    result[i] = np.argmin(dist)

上週有提到，像 Iterative Probabilistic Voting Consensus 這樣的方法，並不像共識決投票只能產生唯一的一個集成結果，因為 Iterative Probabilistic Voting Consensus 含有一些隨機性，因此執行數次的集成，其結果都不太相同。我們可以根據評價指標，選擇比較好的一次集成結果，或是即便在基學習器數量不多的情況下，也可以產生比較好的集成。

參考資料

Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
import copy
from sklearn import datasets
from sklearn.metrics import accuracy_scoredef KMeans(X, n_clusters):
    
    dist = np.zeros(n_clusters)
    result = np.zeros(len(X))
    center = X[np.random.randint(0, len(X), size = 3)]
    
    for iteration in range(20):
        for i in range(len(X)):
            for j in range(n_clusters):
                dist[j] = sum((np.array(X[i]) - 
                               np.array(center[j])) ** 2)
            result[i] = np.argmin(dist)
    
        for j in range(n_clusters):
            center[j] = np.mean([X[i] for i in range(len(X)) if 
                                result[i] == j], axis = 0)
    
    return result.astype(int)def KMeans_ensemble(X, n_clusters, n_ensemble):
    
    dist = np.zeros(n_clusters)
    result = np.random.randint(0, n_clusters, size = len(X))
    
    for iteration in range(20):
        for i in range(len(X)):
            for j in range(n_clusters):
                data = [X[k] for k in range(len(X)) 
                       if result[k] == j]
                if(len(data) != 0):
                    dist[j] = np.sum(np.array(X[i]) !=        
                              np.array(data)) / len(data)
            result[i] = np.argmin(dist)
    
    return result.astype(int)def get_optimal_label(X):
    
    for index in range(1, len(X)):
        
        best = -1
        label = [-1, -1, -1]
        
        for label_0 in range(3):
            for label_1 in range(3):
                for label_2 in range(3):
                    p = copy.deepcopy(X[index])
                    p = ["A" if a == 0 else a for a in p]
                    p = ["B" if a == 1 else a for a in p]
                    p = ["C" if a == 2 else a for a in p]
                    p = [label_0 if a == "A" else a for a in p]
                    p = [label_1 if a == "B" else a for a in p]
                    p = [label_2 if a == "C" else a for a in p]
                    
                    score = get_similarity(p, X[0])
                    
                    if(score > best):
                        best = score
                        label = [label_0, label_1, label_2]
        
        X[index] = ["A" if a == 0 else a for a in X[index]]
        X[index] = ["B" if a == 1 else a for a in X[index]]
        X[index] = ["C" if a == 2 else a for a in X[index]]
        X[index] = [label[0] if a == "A" else a for a in X[index]]
        X[index] = [label[1] if a == "B" else a for a in X[index]]
        X[index] = [label[2] if a == "C" else a for a in X[index]]
    
    return np.array(X)iris = datasets.load_iris()
X = iris.data
Y = iris.targetn_cluster = 3
n_ensemble = 99
p = []for i in range(n_ensemble):
    
    p.append(KMeans(X, n_cluster))
    
p = get_optimal_label(p)ensemble = KMeans_ensemble(p.T, n_cluster, n_ensemble)print(ensemble)result = get_optimal_label([Y, ensemble])
print("Ensemble", accuracy_score(result[0], result[1]))