KNN 是常見的監督式學習(Supervised Learning)演算法,其概念就是找出最接近測試資料的數筆訓練資料,再統計這幾筆訓練資料的標籤(Label),得到最終的預測。如圖一左邊,紅色的測試資料正確答案是藍色,而 KNN 確實也能正確預測。
但是,運用這個方法,真的能處理所有資料嗎?我們來看看圖一右邊的資料,如果使用標準的 KNN,反而會得到錯誤的結果。
那該怎麼辦呢?此時可以考慮 Extended Neighbourhood 跟集成式學習(Ensemble Learning)中的 Bootstrap 喔!
一、Extended Neighbourhood
這個方法的概念很簡單,針對測試資料 A,我們是先找到訓練資料集裡最接近 A 的一筆資料 B,然後再從訓練資料集裡找最接近 B 的一筆資料 C,接著找最接近 C 的一筆資料 D,如此重複 K 次。便是 Extended Neighbourhood 的作法喔。
二、Bootstrap
Bootstrap 是集成式學習中很常見的演算法之一,其概念也很單純,假設目前訓練資料有 N 筆資料,我們就對此訓練資料做 N 次取後放回(Sample with Replacement)的抽樣(同一筆資料可以重複被抽到),形成一個新的資料集,接著套 KNN 演算法在這個資料集,得到預測值。
我們重複上述步驟,因為每次都有做抽樣,所以得到的資料集會略有不同,做出來的預測也會有所差異。
最後,使用投票法,整合每次 Bootstrap 得到的預測值,即為集成後預測。
除了對資料作抽樣,也可以進一步對特徵(Feature)做取後不放回(Sample without Replacement)的抽樣。也就是說,原本資料的特徵有 P 個,我們可以只抽出部分的特徵來使用,藉此增加資料集的多元性。
三、KNN 集成
一個研究團隊便是整合上述提到的 Extended Neighbourhood 以及 Bootstrap,打造出集成式 KNN 演算法!
演算法的 Python 實作範例如下,參數 n、p、f、b、k 分別代表資料總數、特徵總數、抽樣的特徵數、bootstrap 的次數、預測時參考的鄰居數。對每一次 bootstrap,我們都要產生抽樣的資料集(取後放回抽樣資料 + 取後不放回抽樣特徵);接著計算資料跟目標的距離、找出最小距離、紀錄預測值、更新目標,重複 K 次。最後使用投票法得到最終的預測。
def exnrule(train_x, train_y, test_x, n, p, f, b, k):
pred = []
for _ in range(b):
sub_train_x, sub_train_y, sub_test_x = bootstrap(train_x,
train_y,
test_x,
n, p, f)
sub_pred = []
golden = sub_test_x
for _ in range(k):
dist = (sub_train_x - sub_test_x) *
(sub_train_x - sub_test_x)
dist = np.sum(dist, axis = 1)
pred_index = np.argsort(dist)[:1]
sub_pred.append(sub_train_y[pred_index])
golden = sub_train_x[pred_index]
sub_train_x = np.delete(sub_train_x,
pred_index, axis = 0)
sub_train_y = np.delete(sub_train_y,
pred_index, axis = 0) pred.append(stats.mode(sub_pred)[0]) return stats.mode(pred)[0]
由於此方法有抽樣的過程,所以每次執行出來的結果會時好時壞,算是這個演算法比較麻煩的地方喔!
參考資料
Ali, Amjad & Hamraz, Muhammad & Gul, Naz & Khan, Dost & Khan, Zardad & Aldahmani, Saeed. (2022). A k nearest neighbours classifiers ensemble based on extended neighbourhood rule and features subsets. 10.48550/arXiv.2205.15111.
關於作者
Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.
完整程式
import numpy as np
import pandas as pd
from scipy import stats
import warnings
warnings.filterwarnings("ignore")df = pd.read_csv("winequality-red.csv")
X = df.iloc[:, 0:11].to_numpy()
Y = df.iloc[:, 11].to_numpy()n = len(X) - 1
p = len(X[0])
f = np.random.choice(range(1, p), size = 1)
k = 10
b = 10def bootstrap(train_x, train_y, test_x, n, p, f):
sub_n = np.random.choice(np.array(range(n)), size = n)
sub_p = np.random.choice(np.array(range(p)),
replace = False, size = f)
sub_train_x = train_x[sub_n]
sub_train_y = train_y[sub_n]
return sub_train_x[:,sub_p], sub_train_y, test_x[sub_p]def exnrule(train_x, train_y, test_x, n, p, f, b, k):
pred = []
for _ in range(b):
sub_train_x, sub_train_y, sub_test_x = bootstrap(train_x,
train_y,
test_x,
n, p, f)
sub_pred = []
golden = sub_test_x
for _ in range(k):
dist = (sub_train_x - sub_test_x) *
(sub_train_x - sub_test_x)
dist = np.sum(dist, axis = 1)
pred_index = np.argsort(dist)[:1]
sub_pred.append(sub_train_y[pred_index])
golden = sub_train_x[pred_index]
sub_train_x = np.delete(sub_train_x,
pred_index, axis = 0)
sub_train_y = np.delete(sub_train_y,
pred_index, axis = 0) pred.append(stats.mode(sub_pred)[0]) return stats.mode(pred)[0]def knn(train_x, train_y, test_x, k):
sub_pred = []
dist = (train_x - test_x) * (train_x - test_x)
dist = np.sum(dist, axis = 1)
pred_index = np.argsort(dist)[:k]
sub_pred = train_y[pred_index]
return stats.mode(sub_pred)[0]hit = 0
target = 0
for _ in range(len(X) - 1):
train_x = X[0:-1]
train_y = Y[0:-1]
test_x = X[-1]
test_y = Y[-1]
pred = exnrule(train_x, train_y, test_x, n, p, f, b, k)
hit = hit + (pred == test_y)
pred = knn(train_x, train_y, test_x, k)
target = target + (pred == test_y)
np.roll(X, 1, axis = 0)
np.roll(Y, 1, axis = 0)print(hit / len(X))
print(target / len(X))