機器學習動手做Lesson 31 — 用 McNemar 檢定來得知要訓練多少基學習器

施威銘研究室
14 min readJun 13, 2022

--

集成式學習(Ensemble Learning)是透過集成大量基學習器(Base Learner),來打造超強的模型。但是,我們到底要訓練多少基學習器才夠呢?

除了根據預測準確度之外,今天要來介紹一篇論文用假設檢定來決定基學習器的數量 [1]。

一、基本概念

我們現在想要知道訓練多少基學習器才夠,如果要用假設檢定(Hypothesis Testing)的方式來回答這個問題,必須要先訂立虛無假設(Null Hypothesis)跟對立假設(Alternative Hypothesis),以下是我們的假設。

虛無假設:n 個基學習器的模型 a,與 m 個基學習器的模型 b,在預測能力的差異無統計顯著

對立假設:n 個基學習器的模型 a,與 m 個基學習器的模型 b,在預測能力的差異有統計顯著

接下來,我們要定義何謂預測能力的差異。

Mab:模型 a 預測錯但是模型 b 預測對的資料總數。

Mba:模型 a 預測對但是模型 b 預測錯的資料總數。

有了上述定義之後,則在虛無假設為真的條件下,可以用 Mab 跟 Mba 計算出以下統計量(Statistic)。

這個統計量是卡方分佈(Chi-squared Distribution),分佈參數(Parameter)又稱自由度(Degree of Freedom)為 1。如果顯著水準(Significant Level)設定是 0.05,那麼這個統計量大於 3.84 時,我們就要考慮拒絕虛無假設,也就是說模型 b 跟模型 a 的預測能力是不同。

如果已經拒絕了虛無假設,且 m > n,通常我們會覺得基學習器越多,模型效能要越好,因此就可以得出模型 b 的效能比模型 a 好,所以就可以繼續添加基學習器。

二、Python 實作

我們用一個自助聚合法(Bootstrap Aggregation)的範例,來實作上述流程,並處理 MNIST 手寫數字辨識問題。關於自助聚合法的詳細說明,請看參考資料 [2]。我們建立自助抽樣(Bootstrap)的函式。

def bootstrap(train_x, train_y):
n = len(train_x)
sub_n = np.random.choice(np.array(range(n)), size = n)
sub_train_x = train_x[sub_n]
sub_train_y = train_y[sub_n]
return sub_train_x, sub_train_y

接著我們要建立一個函式來訓練基學習器,這邊的基學習器是一個很小的神經網路(Neural Network),只有一個隱藏層(Hidden Layer),而且隱藏層的神經元(Neuron)只有 20 個。

def build_learner(train_x, train_y):
model = Sequential()
model.add(Dense(20,
input_dim = train_x.shape[1],
activation='sigmoid'))
model.add(Dense(10,
activation='softmax'))
model.compile(loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = ['accuracy'])
model.fit(train_x,
train_y,
epochs = 5,
batch_size = 100)
return model

以下函式會使用所有基學習器做預測,並且透過硬投票(Hard Voting)的方式來得到最終答案,關於硬投票的詳細說明,請見參考資料 [2]。

def ensemble_predict(model, valid_x):
pred = []
for learner in model:
prob = learner.predict(valid_x)
pred.append(np.argmax(prob, axis = 1))
pred = np.array(pred)
return stats.mode(pred)[0][0]

我們使用一個函式來操作假設檢定,只要將集成後預測跟真實標籤做比較,就能算出 Mab 跟 Mba,最後代入檢定公式,即可得到統計量。

def test(p_a, p_b, y_true):
Mab = sum((p_a != y_true) & (p_b == y_true))
Mba = sum((p_b != y_true) & (p_a == y_true))
statistics = (np.abs(Mab - Mba) - 1)**2 / (Mab + Mba)
print("Mab :", Mab, "Mba :", Mba, "statistics :", statistics)
return statistics

主程式如下,我們每次多訓練一個基學習器,就做一次假設檢定,看看 model_b 跟 model_a 的預測能力差異(model_b 總是比 model_a 多一個基學習器),是否統計顯著。

model_a = []
model_b = []
sub_train_x, sub_train_y = bootstrap(train_x, train_y)
learner = build_learner(sub_train_x, sub_train_y)
model_a.append(learner)
model_b.append(learner)
for _ in range(100):
sub_train_x, sub_train_y = bootstrap(train_x, train_y)
learner = build_learner(sub_train_x, sub_train_y)

model_b.append(learner)

pred_a = ensemble_predict(model_a, valid_x)
pred_b = ensemble_predict(model_b, valid_x)

statistics = test(pred_a, pred_b, valid_y)

print("ensemble accuracy: ", accuracy_score(pred_a, valid_y))

if(statistics >= 3.841459):
model_a.append(learner)
else:
break

print("recommend ensemble size :", len(model_a))

輸出結果得知,我們整合 6 個基學習器就足夠了!

Epoch 1/5
210/210 [==============================] - 1s 2ms/step - loss: 1.6233 - accuracy: 0.5400
Epoch 2/5
210/210 [==============================] - 0s 2ms/step - loss: 1.0108 - accuracy: 0.7587
Epoch 3/5
210/210 [==============================] - 0s 2ms/step - loss: 0.7503 - accuracy: 0.8265
Epoch 4/5
210/210 [==============================] - 0s 2ms/step - loss: 0.6096 - accuracy: 0.8515
Epoch 5/5
210/210 [==============================] - 0s 2ms/step - loss: 0.5426 - accuracy: 0.8611
Epoch 1/5
210/210 [==============================] - 1s 2ms/step - loss: 1.6554 - accuracy: 0.5478
Epoch 2/5
210/210 [==============================] - 0s 2ms/step - loss: 0.9684 - accuracy: 0.7941
Epoch 3/5
210/210 [==============================] - 0s 2ms/step - loss: 0.7043 - accuracy: 0.8453
Epoch 4/5
210/210 [==============================] - 0s 2ms/step - loss: 0.5899 - accuracy: 0.8561
Epoch 5/5
210/210 [==============================] - 0s 2ms/step - loss: 0.5147 - accuracy: 0.8701
Mab : 542 Mba : 611 statistics : 4.01040763226366
ensemble accuracy: 0.8676190476190476
(...中間略...)Epoch 1/5
210/210 [==============================] - 1s 2ms/step - loss: 1.6179 - accuracy: 0.5456
Epoch 2/5
210/210 [==============================] - 0s 2ms/step - loss: 0.9508 - accuracy: 0.8027
Epoch 3/5
210/210 [==============================] - 0s 2ms/step - loss: 0.7050 - accuracy: 0.8398
Epoch 4/5
210/210 [==============================] - 0s 2ms/step - loss: 0.5834 - accuracy: 0.8596
Epoch 5/5
210/210 [==============================] - 0s 2ms/step - loss: 0.5256 - accuracy: 0.8663
Mab : 174 Mba : 158 statistics : 0.677710843373494
ensemble accuracy: 0.8958571428571429
recommend ensemble size : 6

參考資料

[1] Latinne, Patrice & Debeir, Olivier & Decaestecker, Christine. (2001). Limiting the Number of Trees in Random Forests. Lect Notes Comput Sci. 2096. 178–187. 10.1007/3–540–48219–9_18.

[2] 張康寶(譯)(2022)。集成式學習 — Python實踐!整合全部技術,打造最強模型(原作者:George Kyriakides、Konstantinos G. Margaritis)。台北市:旗標科技。(原著出版年:2019)

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.

完整程式

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics import accuracy_score
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
train = pd.read_csv('train.csv')X = train.drop(['label'], axis=1).to_numpy()
Y = train['label'].to_numpy()
train_x = X[:21000]
train_y = Y[:21000]
valid_x = X[21000:]
valid_y = Y[21000:]
train_y = to_categorical(train_y, 10)def bootstrap(train_x, train_y):
n = len(train_x)
sub_n = np.random.choice(np.array(range(n)), size = n)
sub_train_x = train_x[sub_n]
sub_train_y = train_y[sub_n]
return sub_train_x, sub_train_y
def build_learner(train_x, train_y):
model = Sequential()
model.add(Dense(20,
input_dim = train_x.shape[1],
activation='sigmoid'))
model.add(Dense(10,
activation='softmax'))
model.compile(loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = ['accuracy'])
model.fit(train_x,
train_y,
epochs = 5,
batch_size = 100)
return model
def ensemble_predict(model, valid_x):
pred = []
for learner in model:
prob = learner.predict(valid_x)
pred.append(np.argmax(prob, axis = 1))
pred = np.array(pred)
return stats.mode(pred)[0][0]
def test(p_a, p_b, y_true):
Mab = sum((p_a != y_true) & (p_b == y_true))
Mba = sum((p_b != y_true) & (p_a == y_true))
statistics = (np.abs(Mab - Mba) - 1)**2 / (Mab + Mba)
print("Mab :", Mab, "Mba :", Mba, "statistics :", statistics)
return statistics
model_a = []
model_b = []
sub_train_x, sub_train_y = bootstrap(train_x, train_y)
learner = build_learner(sub_train_x, sub_train_y)
model_a.append(learner)
model_b.append(learner)
for _ in range(100):
sub_train_x, sub_train_y = bootstrap(train_x, train_y)
learner = build_learner(sub_train_x, sub_train_y)

model_b.append(learner)

pred_a = ensemble_predict(model_a, valid_x)
pred_b = ensemble_predict(model_b, valid_x)

statistics = test(pred_a, pred_b, valid_y)

print("ensemble accuracy: ", accuracy_score(pred_a, valid_y))

if(statistics >= 3.841459):
model_a.append(learner)
else:
break

print("recommend ensemble size :", len(model_a))

--

--

施威銘研究室
施威銘研究室

Written by 施威銘研究室

致力開發AI領域的圖書、創客、教具,希望培養更多的AI人才。整合各種人才,投入創客產品的開發,推廣「實作學習」,希望實踐學以致用的理想。

No responses yet