機器學習動手做Lesson 5— 使用Permutation Importance來選取重要特徵(下篇)

11 min readJul 9, 2021

上週我們介紹了Permutation Importance，這是一個簡單的方法可以幫助我們判斷哪一個特徵比較有用。但是，這個方法有這麼萬能嗎？今天，讓我們來仔細研究一下。

一、回顧

本文要用到部分上篇的程式，因此先來回顧一下。下面為產生資料集的函式：

def func_get_data(N_data, N_feature):
    x = np.random.normal(loc = 0,
                         scale = 1.0,
                         size = (N_data, N_feature))
    y = 2 * x.T[0] + np.exp(x.T[3])
    return x, y

下面是進行Permutation Importance的函式：

def func_permutation(model, x, y, err):    for index_feature in range(len(x[0])):
        list_score = []
        for index_sample in range(len(x)):
            dat = copy.deepcopy(x[0])
            ans = y[0]
            for index_permute in range(1, len(x)):
                dat[index_feature] = x[index_permute][index_feature]
                pred = model.predict(dat.reshape(1, -1))
                score = mean_squared_error([ans],
                                           pred,
                                           squared=False)
                list_score.append(score)
            
            x = np.roll(x, 1, axis = 0)
            y = np.roll(y, 1, axis = 0)
        
        print("Permute feature:",
              index_feature,
              ", get importance:",
              np.mean(list_score) - err)

下面我們建模，計算訓練資料的Permutation Importance：

np.random.seed(0)
N_data = 50
N_feature = 5# Training Permutation Importance
x, y = func_get_data(N_data, N_feature)
model = RandomForestRegressor(max_depth = 2).fit(x, y)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Model Training Error:", err)
func_permutation(model, x, y, err)

輸出如下：

Model Training Error: 1.0616756796895293
Permute feature: 0 , get importance: 1.0274945595035518
Permute feature: 1 , get importance: -0.2743515824983511
Permute feature: 2 , get importance: -0.2847040473884165
Permute feature: 3 , get importance: 1.358404906392464
Permute feature: 4 , get importance: -0.26483976772753015

二、高相關性特徵造成Permutation Importance低估特徵重要性

現在來仔細想一下，當我們在訓練模型時，其實是要讓模型找到特徵跟標籤之間的關係。而Permutation Importance的原理，是要打亂這個關係。如果模型發現某個特徵跟標籤很有關係，那麼打亂這個特徵後，模型的準確度會下降很多。

但是，如果今天有兩個很類似的特徵，舉例來說，我們要用華氏溫度跟攝氏溫度這兩種特徵來預測冰淇淋銷售量，那麼即使我們打亂其中一個特徵，模型依舊可以用另一個特徵來做預測。兩個特徵很相似，刪掉其中一個對預測結果或許影響不大，因此會產生誤判特徵重要性，但實際上兩個特徵一樣重要。這樣Permutation Importance是不是就低估特徵的重要性了！

我們來驗證這件事情吧！首先，我們把上述程式中，把兩個特徵弄成一模一樣：

x.T[4] = x.T[3]

接著，我們重新建模，並且執行Permutation Importance。

model = RandomForestRegressor(max_depth = 2).fit(x, y)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Model Training Error:", err)
func_permutation(model, x, y, err)

輸出結果如下：

Model Training Error: 1.0466908346231434
Permute feature: 0 , get importance: 1.0659735349063149
Permute feature: 1 , get importance: -0.2905082899964956
Permute feature: 2 , get importance: -0.28291262399017536
Permute feature: 3 , get importance: 0.33717581843907984
Permute feature: 4 , get importance: 0.5982680395610811

可以很明顯地看到，feature 3的重要性下降且feature 4的重要性提升了！如同我們預期，高度相似的特徵，會造成Permutation Importance低估特徵重要性。因此，當要使用Permutation Importance之前，建議可以先檢查特徵的相關性，比如說用Pearson Correlation Coefficient或Spearman Correlation Coefficient，詳細說明可以參考旗標出版的「資料科學的建模基礎 — 別急著coding！你知道模型的陷阱嗎？」。

三、模型過度配適造成Permutation Importance高估特徵重要性

模型過度配適，簡單說就是模型去抓到特徵裡頭一些雜訊，以為這些雜訊有意義，而非真正抓到特徵跟標籤的關係。關於更多過度配適的說明，可以參考這篇。

換個方式想，過度配適也可以想成「模型太過重視某個特徵，其實該特徵跟標籤並非這麼有關係，模型只是誤抓一堆雜訊而已」。因此，如果我們使用Permutation Importance時，遇到一個過度配適的模型，就有可能高估特徵重要性。

我們來看一個範例，首先我們需要一個可以讓模型輕易過度配適的資料集。

def func_overfitting_data(N_data):
    e = np.random.normal(loc = 0, scale = 1.0, size = N_data)
    x = np.linspace(start = -2, stop = 2, num = N_data)
    y = x + x**3 + 2 * e
    x = np.concatenate(([x],
                        [x**2],
                        [x**3],
                        [x**4],
                        [x**5],
                        [x**6],
                        [x**7]), axis = 0)
    return x.T, y

這段程式可以看出來，特徵跟標籤其實只有三次方的關係。現在，我們給模型很多高次方的特徵，讓模型更有機會誤判雜訊跟標籤有關係。

接下來，我們建立模型，並且做出預測。

x, y = func_overfitting_data(N_data)
model = RandomForestRegressor().fit(x, y)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Model Training Error:", err)func_permutation(model, x, y, err)x, y = func_overfitting_data(N_data)
pred = model.predict(x)
err = mean_squared_error(y, pred, squared=False)
print("Model Testing Error:", err)func_permutation(model, x, y, err)

為了展示過度配適的影響，我們將原本超參數max_depth = 2移除，讓模型全力抓雜訊跟標籤的關係。讓我們看一下結果：

Model Training Error: 0.781349434595992
Permute feature: 0 , get importance: 0.552101830817091
Permute feature: 1 , get importance: 0.09724067334119402
Permute feature: 2 , get importance: 0.6438505436715355
Permute feature: 3 , get importance: 0.04818855424802326
Permute feature: 4 , get importance: 0.5612009974851658
Permute feature: 5 , get importance: 0.040477137312374456
Permute feature: 6 , get importance: 0.639716318236954
Model Testing Error: 2.1539242695811427
Permute feature: 0 , get importance: -0.025626816319340495
Permute feature: 1 , get importance: -0.31557802746050334
Permute feature: 2 , get importance: 0.03802460538390662
Permute feature: 3 , get importance: -0.3349522137661738
Permute feature: 4 , get importance: -0.010636557256972878
Permute feature: 5 , get importance: -0.3582242901145105
Permute feature: 6 , get importance: 0.061714464789204726

我們可以從訓練資料的Permutation Importance當中看到，7次方(feature 6)的特徵竟然得到0.6397分，然而實際上特徵跟標籤根本不具有7次方的關係。所以從測試資料中就發現7次方的特徵只有0.0617分。因此，我們驗證了過度配適會造成Permutation Importance高估特徵重要性。

如何解決這個問題呢？可以使用各種常規化(Regularisation)方法來避免模型過度配適，關於常規化可以參考這篇。大家也可以如同範例，使用Permutation Importance在訓練資料以及測試資料上，若得到類似的結果，那麼對於特徵重要性的判斷，就比較沒問題囉。

重點整理

1、高度相關的特徵會讓Permutation Importance低估特徵重要性，建議可以先使用相關係數來檢測特徵相關性。

2、過度配適的模型會讓Permutation Importance高估特徵重要性，建議可以使用常規化，或是對訓練資料跟測試資料都使用Permutation Importance。

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.