機器學習看得見Lesson 8— 用矩形特徵空間讓決策樹一目了然

15 min readOct 22, 2021

很多決策樹視覺化的方式，都是將模型決策的過程畫成樹狀圖。然後，我們就得從頭到尾慢慢查詢，到底某一筆資料最後會落在哪一片葉子。難道沒有更方便的視覺化方法嗎？

有的，就是矩形特徵空間（Rectangular Feature Space），這個方法可以讓我們一眼就看出模型是如何根據某 2 個變數做出決策。圖一是小編使用梯度提升決策樹來處理 Kaggle 平台House Prices — Advanced Regression Techniques資料集的結果，可以看出模型將 OverallQual 分為 4 個部分。其中 OverallQual 小於 7 以及 OverallQual 介於 7 到 8 之間會再判斷 GrLivArea，模型的決策方式一目了然。

現在，我們就來示範如何畫出圖一吧。

一、準備函式庫、資料集

此範例使用的梯度提升決策樹函式庫為xgboost，我們將使用xgboost裡的XGBRegressor函式，來處理一個迴歸問題。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from scipy.stats import skew
import copytrain = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

House Prices — Advanced Regression Techniques是一個預測房價的問題，訓練資料有1460筆，測試資料有1459筆，特徵有80個。由於矩形特徵空間是平面圖形，因此我們 1 次只能分析 2 個變數，像是圖一中我們只看 OverallQual 跟 GrLivArea，如果要同時觀察很多個變數，就不太適合使用矩形特徵空間。

二、特徵工程（Feature Engineering）

要建模之前，必須先將特徵轉換成適合梯度提升決策樹使用的型態。我們這裡的處理包含將偏態（skew）較大的特徵分布轉成近似常態分布（normal distribution）、用平均值填補缺失值、對類別變數使用One-hot encoding、最後將資料轉成xgboost可以使用的格式。對於操作的詳細說明，請看參考旗標出版的「自學機器學習 — 上Kaggle接軌世界，成為資料科學家」第3章。

train["SalePrice"] = np.log1p(train["SalePrice"])
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x:skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data[:train.shape[0]].mean())
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
dtrain = xgb.DMatrix(X_train, label = y)

三、建立梯度提升決策樹

為了方便說明，我們只用一個簡單的範例。以下程式建立的梯度提升決策樹模型含有 1 棵決策樹，此決策樹最多只會有 3 次分支。學習率（Learning Rate）為0.1，損失函數（Loss Function）為均方誤差（Mean Squared Error）。此外，我們關掉常規化（Regularisation）的功能（min_child_weight = 0、reg_lambda = 0）。

model_xgb = xgb.XGBRegressor(n_estimators = 1,
                             max_depth = 3,
                             learning_rate = 0.1,
                             objective = 'reg:squarederror',
                             min_child_weight = 0,
                             reg_lambda = 0)
model_xgb.fit(X_train, y)
model_xgb.get_booster().dump_model('xgb_model.txt', with_stats=True)
print(model_xgb)

訓練完後，我們存下模型的決策方式，並且印出模型的基本資訊，此時可以檢查超參數的設定是否有問題。

我們可以應用 xgboost 提供的 plot_tree，來將決策樹的決策過程畫出來，結果如圖三。關於怎麼看這張圖，可以參考這篇文章。

fig, ax = plt.subplots(figsize=(20,20))
xgb.plot_tree(model_xgb, 
              ax = ax, 
              num_trees = 0)

四、將模型決策過程存成 List，方便之後使用

剛剛有將模型的決策過程存在 xgb_model.text，先在我們讀取到程式，並且印出來看看，結果如圖四。

with open('xgb_model.txt', 'r') as f:
    text_model = f.readlines()
    f.close()for each_line in text_model:
    print(each_line)

閱讀方式為：從第2行開始，冒號前的數字是決策樹的節點 ID，冒號後面如果是接中括號，裡面就會寫決策方式、決策結果要到哪個節點等等資訊。冒號後面如果是接 leaf，則代表最終的預測值。

眼睛的讀者會發現，圖四中的 OverallQual 判斷閾值，跟圖三不一樣。其實是因為此資料集的 OverallQual 都是正整數值，所以判斷「小於 6.5」跟判斷「小於 7」，結果會一樣。

現在，我們要把文字檔轉換成方便使用的格式。透過 Python 提供的字串處理函式 split，我們可以判斷文字中是否有「：」、「yes=」、「no=」等等關鍵字，擷取想要用的資訊放在一個 list 裡頭。關於 Python 字串處理的詳細使用說明，請參考旗標出版的「一步到位！Python 程式設計 — 從基礎到資料科學應用，學習大數據分析的關鍵能力」。

def func_extract_feature(text):
    sub = text.split("<")
    if(len(sub) > 1):
        variable = sub[0].split("[")[1]
        condition = sub[1].split("]")[0]
        node = text.split(":")[0]
        left = text.split("yes=")[1].split(",")[0]
        right = text.split("no=")[1].split(",")[0]
        return [int(node), 
                variable, 
                float(condition), 
                int(left), 
                int(right)]
    else:
        return -1def func_extract_leaf(text):
    result = text.find("leaf")
    if(result != -1):
        node = text.split(":")[0]
        sub = text.split(",")
        leaf = sub[0].split("=")[1]
        return [int(node), float(leaf)]
    else:
        return -1condition = []
leaf = []
for sub_text in text_model:
    if(func_extract_feature(sub_text) != -1):
        condition.append(func_extract_feature(sub_text))
    if(func_extract_leaf(sub_text) != -1):
        leaf.append(func_extract_leaf(sub_text))
        
for each in condition:
    print(each)
for each in leaf:
    print(each)

我們的 list 呈現出來如圖五，看起來乾淨清楚多了。

五、畫出矩形特徵空間

繼續講解之前，我們再看一眼矩形特徵空間。

想要畫出這張圖，我們必須抓出決策點的變數以及閾值。現在我們關心的變數只有 OverallQual 跟 GrLivArea，至於閾值就要從上一節存好的 condition list 抓出來。OverallQual 比較簡單，只要 condition list 當中的變數名字有符合，就可以抓出閾值。比如 condition list 第 1 行代表 OverallQual < 7，所以直接在圖六的 x = 7 位置畫一條垂直線。

但是，要處理 GrLivArea 就比較麻煩，雖然我們一樣可以知道 GrLivArea 的決策閾值，然而圖六可以看出，我們必須知道水平線的左邊界跟右邊界是多少！！！該怎麼做呢？以找右邊界為例，演算法如下：

1、針對一個用 GrLivArea 做判斷的節點 A，如果有一個用 OverallQual 做判斷的節點 B，此節點 B 的左邊分支接到節點 A，那我們就找到右邊界。

2、針對一個用 GrLivArea 做判斷的節點 A，如果完全沒有一個節點，使用 OverallQual 做判斷且左邊分支接到節點 A，那我們就找節點 A 的母節點（parent node），看看是否有使用 OverallQual 做判斷且左邊分支接到該母節點，直到搜尋至決策樹的根結點（root node）。

3、如果發現都沒有找到右邊界，就直接帶入資料集中此特徵的最大值。

def func_find_bound(current_node, 
                    condition, 
                    target, 
                    data_min, 
                    data_max):
    right_bound = -1
    left_bound = -1
    node = copy.deepcopy(current_node)
    while(node != 0):
        for each_condition in condition:
            if(each_condition[3] 
               == node)and(each_condition[1] 
                           == target):
                right_bound = each_condition[2]
            elif((each_condition[3] 
                  == node)or(each_condition[4] 
                             == node)):
                parent = each_condition[0]
        if(right_bound == -1):
            node = parent
        else:
            node = 0
    
    node = copy.deepcopy(current_node)
    while(node != 0):
        for each_condition in condition:
            if(each_condition[4] 
               == node)and(each_condition[1] 
                           == target):
                left_bound = each_condition[2]
            elif((each_condition[3] 
                  == node)or(each_condition[4] 
                             == node)):
                parent = each_condition[0]
        if(left_bound == -1):
            node = parent
        else:
            node = 0
    
    if(left_bound == -1):
        left_bound = data_min
    if(right_bound == -1):
        right_bound = data_max
    
    return [left_bound, right_bound]

可以找到左右邊界後，我們就可以把圖畫出來了。讀者可能會疑惑怎麼知道哪個預測值要放在哪裡。其實很簡單，我們把每個 leaf list 的節點編號，代入 func_find_bound 函式，找出上下左右的邊界，就可以知道該預測值要放在圖中的哪裡囉。

xmin = X_train['OverallQual'].min()
xmax = X_train['OverallQual'].max()
ymin = X_train['GrLivArea'].min()
ymax = X_train['GrLivArea'].max()have_text = []plt.figure(figsize=(15,5))for each_condition in condition:
    if(each_condition[1] == 'OverallQual'):
        plt.plot([each_condition[2], each_condition[2]], 
                 [ymin, ymax], 
                 color = "black")
    if(each_condition[1] == 'GrLivArea'):
        bound = func_find_bound(each_condition[0], 
                                condition, 
                                'OverallQual', 
                                xmin, 
                                xmax)
        plt.plot(bound, 
                 [each_condition[2], each_condition[2]], 
                 color = "black")for each_leaf in leaf:
    x = np.mean(func_find_bound(each_leaf[0], 
                                condition, 
                                'OverallQual', 
                                xmin, 
                                xmax))
    y = np.mean(func_find_bound(each_leaf[0], 
                                condition, 
                                'GrLivArea', 
                                ymin, 
                                ymax))
    
    # This is only for plotting beautifully - start
    x = x - 0.3
    if([x, y] in have_text):
        y = y - 0.1
    else:
        have_text.append([x, y])
    # This is only for plotting beautifully - end
    plt.text(x, y, each_leaf[1])
    
plt.xlim([xmin, xmax])
plt.ylim([ymin, ymax])
plt.xlabel('OverallQual')
plt.ylabel('GrLivArea')
plt.show()

小編發現 Github 上有人實作更完整版的矩形特徵空間，大家可以試試看喔。

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.