機器學習動手做Lesson 17 — 透過重設索引來集成非監督式學習是否可行？

10 min readFeb 17, 2022

一些講解集成式學習（Ensemble Learning）的文章，都是描述如何整合迴歸（Regression）或是分類（Classification）基學習器（Base Learner）。但是，如果要整合非監督式學習（Unsupervised Learning）演算法，該怎麼做呢？

其實，整合非監督式學習演算法，是還滿麻煩（可能因為如此，才會很少文章提及）。本文先說明為什麼麻煩，並且提出可能的解決方案，以及該方案可能的問題。

一、集成非監督式學習演算法的麻煩之處

非監督式學習的輸出，與監督式學習的分類問題，最大的差異在於「分群（Clustering）索引無法直接做運算」。什麼意思呢？我們先來看一個集成分類問題的範例。

+=====+===========+===========+===========+==========+=======+
| No. | Learner 1 | Learner 2 | Learner 3 | Ensemble | Label |
+=====+===========+===========+===========+==========+=======+
|   0 |         0 |         1 |         0 |        0 |     0 |
+-----+-----------+-----------+-----------+----------+-------+
|   1 |         1 |         1 |         2 |        1 |     1 |
+-----+-----------+-----------+-----------+----------+-------+
|   2 |         0 |         2 |         2 |        2 |     2 |
+-----+-----------+-----------+-----------+----------+-------+
|   3 |         1 |         1 |         1 |        1 |     1 |
+-----+-----------+-----------+-----------+----------+-------+

上述範例中，單一基學習器的準確率（Accuracy）都是 75%，但是經過多數決投票之後，集成後模型的準確率可以達到 100%，這也是我們使用集成式學習的目標。

接下來，我們來看一個非監督式學習的範例，以下是 3 個分群的輸出。

+=====+===========+===========+===========+
| No. | Learner 1 | Learner 2 | Learner 3 |
+=====+===========+===========+===========+
|   0 |         0 |         1 |         2 |
+-----+-----------+-----------+-----------+
|   1 |         0 |         1 |         2 |
+-----+-----------+-----------+-----------+
|   2 |         1 |         2 |         0 |
+-----+-----------+-----------+-----------+
|   3 |         1 |         2 |         0 |
+-----+-----------+-----------+-----------+
|   4 |         2 |         0 |         1 |
+-----+-----------+-----------+-----------+
|   5 |         2 |         0 |         1 |
+-----+-----------+-----------+-----------+

可以發現，其實這 3 個演算法的分群結果是一致：第 0 筆跟第 1 筆放在一群、第 2 筆跟第 3 筆放在一群、第 4 筆跟第 5 筆放在一群。

可是，只看輸出的分群索引，並沒有辦法透過加減乘除、或是取多數來進行集成。因為分群索引只是一個代碼，你想怎麼指定（甚至用英文 A、B、C 取代數字 0、1、2 當索引）都不會影響到分群結果。

那該怎麼集成非監督式演算法？

二、重設索引

既然我們提到索引可以隨便指定，那有一個簡單的方式可以處理非監督式學習的集成：重設索引。

作法很簡單：對每一個基學習器，從頭到尾掃過資料，依序將索引從 0 開始重新標註。

我們現在看第二個分群演算法（Learner 2）的輸出，第 0 筆資料的原始索引是 1，我們就把它改成 0。接下來看到第 1 筆資料的原始索引是 1，我們剛剛已經把原始索引 1 改成 0，所以第 1 筆資料的新索引也要是 0。同樣的道理，把原始索引 2 都改成 1、把原始索引 0 都改成 3。

我們用同樣的想法處理第三個分群演算法（Learner 3）的輸出，把原始索引 2 都改成 0、把原始索引 0 都改成 1、把原始索引 1 都改成 2。

全部都改完之後，結果如下。

+=====+===========+===========+===========+
| No. | Learner 1 | Learner 2 | Learner 3 |
+=====+===========+===========+===========+
|   0 |         0 |         0 |         0 |
+-----+-----------+-----------+-----------+
|   1 |         0 |         0 |         0 |
+-----+-----------+-----------+-----------+
|   2 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   3 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   4 |         2 |         2 |         2 |
+-----+-----------+-----------+-----------+
|   5 |         2 |         2 |         2 |
+-----+-----------+-----------+-----------+

這樣就能夠輕鬆做數學運算，來進行集成囉！以下是 Python 實作的程式碼，我們建立一個 mapping，用來記錄新舊索引的關係。mapping 的索引代表就索引，mapping 內容存放新的索引。

for i in range(n_ensemble):
    
    relabel = 0
    mapping = [n_clusters] * n_clusters
    
    p = KMeans(X, n_clusters)
    
    for j in range(len(Y)):
        if(mapping[p[j]] == n_clusters):
            mapping[p[j]] = relabel
            relabel = relabel + 1
            
        p[j] = mapping[p[j]]
    
    candidate.append(p)for j in range(len(Y)):
    result[j] = most_common([row[j] for row in candidate])

三、重設索引的問題

重設索引雖然很簡單，但是有些時候並不能完美解決問題。我們來看以下範例。

+=====+===========+===========+===========+
| No. | Learner 1 | Learner 2 | Learner 3 |
+=====+===========+===========+===========+
|   0 |         0 |         0 |         0 |
+-----+-----------+-----------+-----------+
|   1 |         0 |         0 |         0 |
+-----+-----------+-----------+-----------+
|   2 |         1 |         1 |         2 |
+-----+-----------+-----------+-----------+
|   3 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   4 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   5 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   6 |         1 |         1 |         1 |
+-----+-----------+-----------+-----------+
|   7 |         2 |         2 |         2 |
+-----+-----------+-----------+-----------+
|   8 |         2 |         2 |         2 |
+-----+-----------+-----------+-----------+

依照重設索引的作法，我們會將 Learner 3 的索引 2 都改成 1、索引 1 都改成索引 2。但其實這樣改反而更慘...

剛剛我們所使用的重設索引，因為只考慮過去看過的資料，就決定了新的索引，並沒有看後續的資料，才會導致一步錯步步錯的問題。

所以，集成非監督式學習演算法，確實不太容易。有什麼更好的方式，我們下週再來討論，或是讀者可以參考「集成式學習：Python 實踐！整合全部技術，打造最強模型」第 8 章的內容。

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_scoredef KMeans(X, n_clusters):
    
    dist = np.zeros(n_clusters)
    result = np.zeros(len(X))
    center = X[np.random.randint(0, len(X), size = 3)]
    
    for iteration in range(20):
        for i in range(len(X)):
            for j in range(n_clusters):
                dist[j] = sum((np.array(X[i]) - 
                               np.array(center[j])) ** 2)
            result[i] = np.argmin(dist)
    
        for j in range(n_clusters):
            center[j] = np.mean([X[i] for i in range(len(X)) if 
                                result[i] == j], axis = 0)
    
    return result.astype(int)def most_common(x):
    return max(set(x), key=x.count)iris = datasets.load_iris()
X = iris.data
Y = iris.targetn_clusters = 3
n_ensemble = 11candidate = []
result = np.zeros(len(Y))for i in range(n_ensemble):
    
    relabel = 0
    mapping = [n_clusters] * n_clusters
    
    p = KMeans(X, n_clusters)
    
    for j in range(len(Y)):
        if(mapping[p[j]] == n_clusters):
            mapping[p[j]] = relabel
            relabel = relabel + 1
            
        p[j] = mapping[p[j]]
    
    candidate.append(p)for j in range(len(Y)):
    result[j] = most_common([row[j] for row in candidate])print("Ensemble", accuracy_score(Y, result))
print(result)
print(Y)