機器學習動手做Lesson 30 — Extended Neighbourhood + Bootstrap 的 KNN 演算法

8 min readJun 6, 2022

KNN 是常見的監督式學習（Supervised Learning）演算法，其概念就是找出最接近測試資料的數筆訓練資料，再統計這幾筆訓練資料的標籤（Label），得到最終的預測。如圖一左邊，紅色的測試資料正確答案是藍色，而 KNN 確實也能正確預測。

但是，運用這個方法，真的能處理所有資料嗎？我們來看看圖一右邊的資料，如果使用標準的 KNN，反而會得到錯誤的結果。

圖一，測試資料（紅色）的正確答案是藍色，但是在右邊的狀況，使用一般 KNN，可能會得到錯誤的結果。

那該怎麼辦呢？此時可以考慮 Extended Neighbourhood 跟集成式學習（Ensemble Learning）中的 Bootstrap 喔！

一、Extended Neighbourhood

這個方法的概念很簡單，針對測試資料 A，我們是先找到訓練資料集裡最接近 A 的一筆資料 B，然後再從訓練資料集裡找最接近 B 的一筆資料 C，接著找最接近 C 的一筆資料 D，如此重複 K 次。便是 Extended Neighbourhood 的作法喔。

二、Bootstrap

Bootstrap 是集成式學習中很常見的演算法之一，其概念也很單純，假設目前訓練資料有 N 筆資料，我們就對此訓練資料做 N 次取後放回（Sample with Replacement）的抽樣（同一筆資料可以重複被抽到），形成一個新的資料集，接著套 KNN 演算法在這個資料集，得到預測值。

我們重複上述步驟，因為每次都有做抽樣，所以得到的資料集會略有不同，做出來的預測也會有所差異。

最後，使用投票法，整合每次 Bootstrap 得到的預測值，即為集成後預測。

除了對資料作抽樣，也可以進一步對特徵（Feature）做取後不放回（Sample without Replacement）的抽樣。也就是說，原本資料的特徵有 P 個，我們可以只抽出部分的特徵來使用，藉此增加資料集的多元性。

三、KNN 集成

一個研究團隊便是整合上述提到的 Extended Neighbourhood 以及 Bootstrap，打造出集成式 KNN 演算法！

演算法的 Python 實作範例如下，參數 n、p、f、b、k 分別代表資料總數、特徵總數、抽樣的特徵數、bootstrap 的次數、預測時參考的鄰居數。對每一次 bootstrap，我們都要產生抽樣的資料集（取後放回抽樣資料 + 取後不放回抽樣特徵）；接著計算資料跟目標的距離、找出最小距離、紀錄預測值、更新目標，重複 K 次。最後使用投票法得到最終的預測。

def exnrule(train_x, train_y, test_x, n, p, f, b, k):
    pred = []
    for _ in range(b):
        sub_train_x, sub_train_y, sub_test_x = bootstrap(train_x, 
                                                         train_y, 
                                                         test_x, 
                                                         n, p, f)
        sub_pred = []
        golden = sub_test_x
        for _ in range(k):
            dist = (sub_train_x - sub_test_x) * 
                   (sub_train_x - sub_test_x)
            dist = np.sum(dist, axis = 1)
            pred_index = np.argsort(dist)[:1]
            sub_pred.append(sub_train_y[pred_index])
            golden = sub_train_x[pred_index]
            sub_train_x = np.delete(sub_train_x, 
                                    pred_index, axis = 0)
            sub_train_y = np.delete(sub_train_y, 
                                    pred_index, axis = 0)        pred.append(stats.mode(sub_pred)[0])    return stats.mode(pred)[0]

由於此方法有抽樣的過程，所以每次執行出來的結果會時好時壞，算是這個演算法比較麻煩的地方喔！

參考資料

Ali, Amjad & Hamraz, Muhammad & Gul, Naz & Khan, Dost & Khan, Zardad & Aldahmani, Saeed. (2022). A k nearest neighbours classifiers ensemble based on extended neighbourhood rule and features subsets. 10.48550/arXiv.2205.15111.

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
import pandas as pd
from scipy import stats
import warnings
warnings.filterwarnings("ignore")df = pd.read_csv("winequality-red.csv")
X = df.iloc[:, 0:11].to_numpy()
Y = df.iloc[:, 11].to_numpy()n = len(X) - 1
p = len(X[0])
f = np.random.choice(range(1, p), size = 1)
k = 10
b = 10def bootstrap(train_x, train_y, test_x, n, p, f):
    sub_n = np.random.choice(np.array(range(n)), size = n)
    sub_p = np.random.choice(np.array(range(p)), 
                             replace = False, size = f)
    sub_train_x = train_x[sub_n]
    sub_train_y = train_y[sub_n]
    return sub_train_x[:,sub_p], sub_train_y, test_x[sub_p]def exnrule(train_x, train_y, test_x, n, p, f, b, k):
    pred = []
    for _ in range(b):
        sub_train_x, sub_train_y, sub_test_x = bootstrap(train_x, 
                                                         train_y, 
                                                         test_x, 
                                                         n, p, f)
        sub_pred = []
        golden = sub_test_x
        for _ in range(k):
            dist = (sub_train_x - sub_test_x) * 
                   (sub_train_x - sub_test_x)
            dist = np.sum(dist, axis = 1)
            pred_index = np.argsort(dist)[:1]
            sub_pred.append(sub_train_y[pred_index])
            golden = sub_train_x[pred_index]
            sub_train_x = np.delete(sub_train_x, 
                                    pred_index, axis = 0)
            sub_train_y = np.delete(sub_train_y, 
                                    pred_index, axis = 0)        pred.append(stats.mode(sub_pred)[0])    return stats.mode(pred)[0]def knn(train_x, train_y, test_x, k):
    sub_pred = []
    dist = (train_x - test_x) * (train_x - test_x)
    dist = np.sum(dist, axis = 1)
    pred_index = np.argsort(dist)[:k]
    sub_pred = train_y[pred_index]
    return stats.mode(sub_pred)[0]hit = 0
target = 0
for _ in range(len(X) - 1):
    train_x = X[0:-1]
    train_y = Y[0:-1]
    test_x = X[-1]
    test_y = Y[-1]
    
    pred = exnrule(train_x, train_y, test_x, n, p, f, b, k)
    hit = hit + (pred == test_y)
    
    pred = knn(train_x, train_y, test_x, k)
    target = target + (pred == test_y)
    
    np.roll(X, 1, axis = 0)
    np.roll(Y, 1, axis = 0)print(hit / len(X))
print(target / len(X))