機器學習動手做Lesson 32 — 使用學習率調整以及適應提升的 Snapshot Ensemble

13 min readJun 27, 2022

集成式學習（Ensemble Learning）應用在神經網路（Neural Network）上，常常被詬病的就是訓練時間很久、耗費太多運算資源。Snapshot Ensemble 便是為了解決這個問題而提出的方法 [1]，主要概念是我們可以利用已經訓練好的基學習器（Base Learner），然後調大學習率（Learning Rate），讓基學習器跳到損失（Loss）平面的另外一個位置，然後漸漸調小學習率收斂成另一個基學習器。如此一來就不用每次都要從一個隨機初始化參數（Parameter）開始訓練，藉此有效減少訓練時間。

不僅如此，Snapshot 還整合了適應提升（Adaptive Boosting）演算法，也就是針對預測不準的資料，調大該資料的權重（Weight）。下次抽樣（Sample）、訓練基學習器時，可以更專注在這些預測錯誤的資料。

以下的程式，我們參考了 [2] 第 6 章、第 7 章、第 8 章的範例，實作了部分 Snapshot 的功能，也就是動態調整學習率的部分。至於適應提升的部分，我們只展示計算資料權重的部分，至於資料抽樣等完整實作，可以參考 [3]。

一、資料集與基學習器架構

範例資料集為 CIFAR-10，基學習器是一個卷積神經網路（Convolutional Neural Network）。匯入資料跟建立網路的函式如下。

import numpy as np
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStoppingdef prepare_data():
    (x_train, y_train), (x_test, y_test) = cifar10.load_data()
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train, x_test = x_train / 255.0
    x_test = x_test / 255.0
    y_train = to_categorical(y_train)
    y_test = to_categorical(y_test)
    return x_train, x_test, y_train, y_testdef make_convlayer(lr):
    model = Sequential()
    model.add(Conv2D(filters = 64, 
                     kernel_size = 3, 
                     padding = 'same',
                     activation = 'relu', 
                     input_shape = (32,32,3)))
    model.add(MaxPooling2D(pool_size = 2))
    model.add(Conv2D(filters = 128, 
                     kernel_size = 3, 
                     padding = 'same',
                     activation = 'relu'))
    model.add(MaxPooling2D(pool_size = 2))
    model.add(Conv2D(filters = 256, 
                     kernel_size = 3, 
                     padding = 'same',
                     activation = 'relu')) 
    model.add(MaxPooling2D(pool_size = 2)) 
    model.add(Flatten()) 
    model.add(Dropout(0.4))
    model.add(Dense(512, activation = 'relu'))
    model.add(Dense(10, activation = 'softmax'))
    model.compile(loss = "categorical_crossentropy",
                  optimizer=optimizers.Adam(lr = lr),
                  metrics = ["accuracy"])
    
    return model

二、回呼函數

我們需要動態調低學習率，來讓基學習器可以收斂到比較好的參數。此外，真正參與集成的基學習器，是訓練過程中驗證資料表現最好的參數，所以需要使用回呼函數（Callback Function）來查看哪組參數最好。為了讓訓練過程時間更短，如果驗證資料的表現連續 10 回都不好，就停止訓練基學習器。

class Checkpoint(Callback): 
    def __init__(self, model, filepath): 
        self.model = model
        self.filepath = filepath
        self.best_val_acc = 0.0    def on_epoch_end(self, epoch, logs): 
        if self.best_val_acc < logs['val_accuracy']:
            self.model.save_weights(self.filepath)
            self.best_val_acc = logs['val_accuracy']
            print('Weights saved.', self.best_val_acc)reduce_lr = ReduceLROnPlateau(monitor = 'val_accuracy',
                              factor = 0.5,
                              patience = 5, 
                              verbose = 1,  
                              mode = 'max', 
                              min_lr = 0.0001)earstop = EarlyStopping(monitor = 'val_loss',
                        min_delta = 0,
                        patience = 10)

三、訓練基學習器

一開始訓練時，參數是隨機產生。之後每一次訓練一個新的基學習器，我們會使用前一回最佳的基學習器參數，然後調大學習率。這邊有使用資料擴增（Data Augmentation），並且每個批次（Batch）是 128 筆資料。

def train(x_train, x_test, y_train, y_test, lr, initial):
    
    model = make_convlayer(lr)
    cpont = Checkpoint(model, f'weights.h5')
        
    if(initial == 0):
        model.load_weights(f'weights.h5')
    
    datagen = ImageDataGenerator(width_shift_range = 0.1,
                                 height_shift_range = 0.1,
                                 rotation_range = 10,
                                 zoom_range = 0.1,
                                 horizontal_flip = True)    batch_size = 128    model.fit(datagen.flow(x_train,
                           y_train,
                           batch_size = batch_size),
              steps_per_epoch = x_train.shape[0] // batch_size,
              epochs = 100,
              verbose = 1,
              validation_data = (x_test, y_test),
              callbacks = [reduce_lr, cpont, earstop])    return model

四、集成訓練

如開頭所述，本範例沒有展示適應提升的過程，所以從頭到尾都用一樣的訓練資料集。迴圈內部可以看到，每次訓練完一個基學習器，便會存下該基學習器預測資料的結果，並且根據結果計算每一筆資料的權重。

x_train, x_test, y_train, y_test = prepare_data()n_ensemble = 2
n_train = len(x_train)
n_test = len(x_test)
lr_initial = 0.001
lr_ensemble = 0.01
w = np.ones(n_train) / n_train
pred_train = []
pred_test = []
model = train(x_train, x_test, y_train, y_test, lr_initial, 1)for _ in range(n_ensemble):
    model = train(x_train, x_test, y_train, y_test, lr_ensemble, 0)
    
    predict = model.predict(x_test)
    error = (np.argmax(predict, axis = 1) != 
             np.argmax(y_test, axis = 1)).astype(int)
    error_rate = np.sum(error) / n_test
    beta = 0.5 * np.log((1 - error_rate) / error_rate) 
           + 0.1 * np.log(9)
    
    pred_test.append(predict)
    
    predict = model.predict(x_train)
    error = (np.argmax(predict, axis = 1) != 
             np.argmax(y_train, axis = 1)).astype(int)
    
    w = np.exp(-beta * error) / n_train
    w = w / np.sum(w)
    
    pred_train.append(predict)

訓練結果可以看到，當驗證資料準確率有提升時，我們都存下該組參數。此外，驗證資料準確率一直沒有提升，則會降低學習率，甚至直接中止訓練基學習器。

Epoch 1/100
390/390 [==============================] - 41s 84ms/step - loss: 1.5775 - accuracy: 0.4259 - val_loss: 1.2307 - val_accuracy: 0.5610
Weights saved. 0.5609999895095825
Epoch 2/100
390/390 [==============================] - 32s 82ms/step - loss: 1.2073 - accuracy: 0.5697 - val_loss: 0.9795 - val_accuracy: 0.6521
Weights saved. 0.6521000266075134
（...中間略...）
Epoch 39/100
390/390 [==============================] - 31s 80ms/step - loss: 0.3996 - accuracy: 0.8589 - val_loss: 0.5396 - val_accuracy: 0.8251

Epoch 00048: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 49/100
390/390 [==============================] - 32s 82ms/step - loss: 0.2835 - accuracy: 0.8993 - val_loss: 0.5099 - val_accuracy: 0.8430
Epoch 50/100
390/390 [==============================] - 32s 81ms/step - loss: 0.2801 - accuracy: 0.9011 - val_loss: 0.4742 - val_accuracy: 0.8501
Weights saved. 0.8500999808311462
Epoch 1/100
390/390 [==============================] - 34s 84ms/step - loss: 1.5477 - accuracy: 0.4780 - val_loss: 1.0796 - val_accuracy: 0.6167
Weights saved. 0.6166999936103821
Epoch 2/100
390/390 [==============================] - 32s 81ms/step - loss: 1.2413 - accuracy: 0.5632 - val_loss: 1.1001 - val_accuracy: 0.6233
Weights saved. 0.6233000159263611
（...之後略...）

參考資料

[1] Zhang, Wentao & Jiawei, Jiang & Shao, Yingxia & Cui, Bin. (2020). Snapshot boosting: a fast ensemble framework for deep neural networks. Science China Information Sciences. 63. 10.1007/s11432–018–9944-x.

[2] 温政堯（譯）（2021）。自學機器學習 — 上 Kaggle 接軌世界，成為資料科學家（原作者：チーム・カルポ）。台北市：旗標科技。（原著出版年：2020）

[3] 張康寶（譯）（2022）。集成式學習 — Python實踐！整合全部技術，打造最強模型（原作者：George Kyriakides、Konstantinos G. Margaritis）。台北市：旗標科技。（原著出版年：2019）

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.