機器學習動手做Lesson 4 — 使用Permutation Importance來選取重要特徵(上篇)

10 min readJul 2, 2021

很多人在建模的時候，關注的議題都是準確度。雖然準確度越高越好，然而，弄了一個超級複雜的模型、餵入一大堆亂七八糟的特徵，搞得自己也不知道自己的模型是什麼樣的東西，這樣真的可以嗎？

想一下：如果有一個模型可以判斷是否染疫，某人去做檢測的時候，該模型說「這位人有91%的可能性是染疫」。此時，某人就問「是怎麼判斷出為什麼可能染疫？」，請問如果對模型完全不了解，該怎麼回答這樣的問題？因此，有時候是不能胡亂建模。

接下來兩週，我們會跟大家介紹一個簡單好用的方法：Permutation Importance，可以幫助我們大概了解模型決策時，所參考的特徵是哪些。

一、Permutation Importance的基本概念

這個方法說起來很簡單，我們先用標準的流程訓練好一個模型。接著，選一個特徵(先假設是特徵A好了)出來，將所有資料的特徵A隨便交換，或是可以想成隨機打散。如果模型很依賴特徵A做預測，那顯然打散資料集的特徵A之後，使用同樣的模型來做預測，準確度會下降吧！反之，模型準確度如果沒什麼差異，代表這個特徵可能不是很有用，那可以考慮把此特徵踢出模型，來增加模型訓練、預測的效率。

不過，有可能特徵A其實很重要，但是隨機打散之後，模型重新做預測卻得到差不多的準確度，畢竟隨機打散就代表有可能打散之前跟之後，資料集沒有差異太大。這時候，我們可以反覆執行「打散特徵，重做預測」，最後將得到的準確度取平均，並且跟未打亂特徵的模型準確度相減，即可獲得該特徵的Permutation Importance。

這個方法做到極致是什麼樣子呢？假設現在有10筆資料，我們可以將第1筆資料的特徵A，換成第2筆資料的特徵A後重新做預測，接著換成第3筆資料的特徵A後重新做預測，...，最後換成第10筆資料的特徵A後重新做預測。所以對第1筆資料的特徵A，我們代換9次不同的特徵A並且做預測。每一筆資料都執行這樣的操作，總共會得到9*10筆準確度，取這些準確度的平均值，即為該特徵的Permutation Importance。

如果是N筆資料，M個特徵，那麼我們要總共要做(N-1) * N * M次，即可得到所有特徵的Permutation Importance。看起來需要消耗很多計算資源，所以有些時候可能就隨機打亂幾次資料就好了。

二、產生虛擬資料

接下來帶大家來看一下Permutation Importance的Python實作吧！首先，我們要建立資料集。

def func_get_data(N_data, N_feature):
    x = np.random.normal(loc = 0,
                         scale = 1.0,
                         size = (N_data, N_feature))
    y = 2 * x.T[0] + np.exp(x.T[3])
    return x, y

接下來的程式會看到我們將N_feature設定成5，因此特徵1、特徵2、特徵4其實是垃圾，我們會用Permutation Importance來驗證這件事情。

三、建立Permutation Importance函式

如同前面小節所說，我們抓出某1筆資料，然後將此資料的特徵代換成其他筆資料的特徵。然後對所有資料、所有特徵都執行相同的事情。

def func_permutation(model, x, y, err):    for index_feature in range(len(x[0])):
        list_score = []
        for index_sample in range(len(x)):
            dat = copy.deepcopy(x[0])
            ans = y[0]
            for index_permute in range(1, len(x)):
                dat[index_feature] = x[index_permute][index_feature]
                pred = model.predict(dat.reshape(1, -1))
                score = mean_squared_error([ans],
                                           pred,
                                           squared=False)
                list_score.append(score)
            
            x = np.roll(x, 1, axis = 0)
            y = np.roll(y, 1, axis = 0)
        
        print("Permute feature:",
              index_feature,
              ", get importance:",
              np.mean(list_score) - err)

程式實作上，用了roll的技巧：將資料集捲動1次，因此我們總是抓第1筆資料來跟其他資料做交換。

for index_sample in range(len(x)):
            dat = copy.deepcopy(x[0])
            ans = y[0]
            for index_permute in range(1, len(x)):
                
                #...other code...
            
            x = np.roll(x, 1, axis = 0)
            y = np.roll(y, 1, axis = 0)

交換的方式很簡單，把第2筆到第N筆資料抓出來，帶入第1筆資料即可。接著就做預測，紀錄準確度。在此我們使用的是方均根誤差。

            for index_permute in range(1, len(x)):
                dat[index_feature] = x[index_permute][index_feature]
                pred = model.predict(dat.reshape(1, -1))
                score = mean_squared_error([ans],
                                           pred,
                                           squared=False)
                list_score.append(score)

四、建立模型

這個範例中，我們使用隨機森林(Random Forest)模型，關於此模型的說明，請看旗標出版的「資料科學的建模基礎 — 別急著coding！你知道模型的陷阱嗎？」。

為了避免過度配適，我們將max_depth設定成2。過度配適對於Permutation Importance的影響，留到下周在跟大家詳細介紹。

x, y = func_get_data(N_data, N_feature)
model = RandomForestRegressor(max_depth = 2).fit(x, y)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Model Training Error:", err)
func_permutation(model, x, y, err)

當然，我們也要測試一下模型的效能。並且檢查測試資料集的Permutation Importance是否跟訓練資料差不多。

x_test, y_test = func_get_data(N_data, N_feature)
pred = model.predict(x_test)
err = mean_squared_error(y_test, pred, squared=False)
print("Model Testing Error:", err)
func_permutation(model, x_test, y_test, err)

五、判斷特徵重要性，重新訓練模型

執行完程式之後，我們會得到以下結果。

Model Training Error: 1.0616756796895293
Permute feature: 0 , get importance: 1.0274945595035518
Permute feature: 1 , get importance: -0.2743515824983511
Permute feature: 2 , get importance: -0.2847040473884165
Permute feature: 3 , get importance: 1.358404906392464
Permute feature: 4 , get importance: -0.26483976772753015
Model Testing Error: 1.088438212667154
Permute feature: 0 , get importance: 0.7655624162316859
Permute feature: 1 , get importance: -0.2589471325596818
Permute feature: 2 , get importance: -0.23019780518265698
Permute feature: 3 , get importance: 0.3467580641188943
Permute feature: 4 , get importance: -0.2293535834880599

首先看訓練資料跟測試資料的準確度，兩者的方均根誤差很相近，因此並沒有出現過度配適的現象。接著，訓練資料跟測試資料的Permutation Importance差異並不大，並且都有順利抓出特徵0跟特徵3比較重要，而特徵1、2、4基本上就是垃圾，所以可以踢掉。

我們踢掉垃圾特徵後，重新訓練模型，並且觀察準確度的變化。

x = np.delete(x, [1, 2, 4], axis = 1)
x_test = np.delete(x_test, [1, 2, 4], axis = 1)
model = RandomForestRegressor(max_depth = 2).fit(x, y)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Re-fit Model Training Error:", err)
pred = model.predict(x_test)
err = mean_squared_error(y_test, pred, squared=False)
print("Re-fit Model Testing Error:", err)

輸出結果如下。

Re-fit Model Training Error: 0.9990739249989735
Re-fit Model Testing Error: 1.0871169906832585

可以發現準確度差異不大，由此可見，Permutation Importance是可以協助我們判斷哪些特徵重要，並且做特徵篩選。

重點整理

1、模型的解釋性對於部分應用來說，是非常重要的議題，我們可能需要回答「為什麼模型會做出這樣的判斷」這類的問題。

2、Permutation Importance可以幫助我們判斷哪些特徵比較有用，哪些特徵可能無效。

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.