機器學習動手做Lesson 3 — 利用Quantile Transform讓資料分布趨近於常態分布

10 min readJun 25, 2021

有時我們拿到的資料，分佈長得很奇怪。這可能會造成一些問題，比如有些模型(如最小平方法線性迴歸)，都將「資料為常態分布」作為基本假設；又或是重尾分佈(Heavy-tailed distribution)，模型也可能受到離群值(Outlier)影響。

為了解決奇怪的分佈帶來的問題，可以先將資料做轉換，讓分佈趨近於常態分佈。常見的做法如對數變換。然而，當資料有出現負數的時候，對數變換就失效了。

今天，我們要來介紹Quantile Transform，這招不但可以將資料轉換成常態分佈，而且也可以處理負數。

一、Quantile Transform的基本概念

我們可以將Quantile Transform分解成兩個步驟：將資料轉成0到1之間的均勻分布，接著將已轉成均勻分布的資料再轉成常態分佈

轉換成0到1之間的均勻分布很簡單，只是單純的內插運算。以下圖為例，假設有31筆資料，分佈如圖一紅線上的黑點。首先得決定要將資料分為多少組，這個範例設定為10組。因此我們需要從原資料中，挑選11個數字作為組的邊界：第1筆資料、第4筆資料、第7筆資料、…、第31筆資料。

決定好組數跟組的邊界後，接下來就將其他沒被挑選為邊界的資料點，依照該資料所屬的組別依比例分配。比如圖二中的原資料，第1筆資料跟第2筆資料的距離，顯然大於第2筆資料與第3筆資料的距離；經過轉換後，我們就用一樣比例的距離差異，分配到轉換後資料。

這時候，我們就可以觀察到組數對轉換的影響了：如果組數夠多，內插的資料就比較少，就會比較像均勻分布。

獲得了均勻分布的資料後，接著帶入常態分佈的Percent Point Function(PPF)。這個函數看起來很陌生，但其實就只是累積分佈函數(cumulative density function, CDF)的反函數。畫出來如圖三，如果我們把一個具有均勻分布的陣列代入，將會發現輸出的陣列中，數值介於0到2之間的資料量，大於數字介於2到4之間。因此，經過轉換之後，就會得到一個中心值在0的常態分佈。

二、產生虛擬資料

首先，我們建立一個不是常態分佈的資料集。我們混合兩個不同中央值的常態分佈，作為本次的範例。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import QuantileTransformer
from scipy.stats import normnp.random.seed(1)N_quantile = 10
N_data = 1000data1 = np.random.normal(loc = 0.25,
                         scale = 0.1,
                         size = (N_data // 2))
data2 = np.random.normal(loc = 0.75,
                         scale = 0.1,
                         size = (N_data // 2) + N_quantile + 1)data = np.concatenate((data1, data2), axis = 0)
data = np.sort(data)

程式中，N_quantile即為剛剛說的組數，這個超參數會影響到轉換後有多接近常態分佈，是一個需要依照資料集的原始分佈來調整的超參數。建議大家可以多試幾種數值，然後畫出圖來看看轉換後的結果是否滿意。在本次的範例中，我們先設定為10。

三、將資料先轉換成均勻分布

我們先看程式碼，接下來會分段介紹。

step_quantile = 1 / N_quantile
step_index = N_data // N_quantileindex = 1
current_data = 0
current_quantile = 0
transform = np.zeros(len(data))
while(index != len(data)):
    
    lower_value = current_quantile * step_quantile
 
    lower_index = current_quantile * (step_index + 1)
    upper_index = (current_quantile + 1) * (step_index + 1)
    
    max_diff = data[upper_index] - data[lower_index]
    dat_diff = data[index] - data[lower_index]    ratio = dat_diff * step_quantile / max_diff
    
    transform[index] = lower_value + ratio
    current_data = current_data + 1
    index = index + 1
    
    if(current_data == step_index):
        current_data = 0
        current_quantile = current_quantile + 1
        transform[index] = current_quantile * step_quantile
        index = index + 1

剛剛說到轉換成均勻分布需要使用內插法。step_quantile是圖一藍線上2個綠色點的間隔，也就是每一組的寬度。

step_quantile = 1 / N_quantile

max_diff是目前正在處理的資料，被圖一紅線上2個綠色點包住，此2個綠色點的間隔。也就是原始資料所處的組別，其組別的上下限值差異。

    lower_index = current_quantile * (step_index + 1)
    upper_index = (current_quantile + 1) * (step_index + 1)
    
    max_diff = data[upper_index] - data[lower_index]

dat_diff是目前正在處理的資料，與該資料所屬的組別下限值的差異。

    dat_diff = data[index] - data[lower_index]

有了step_quantile、max_diff、dat_diff之後，我們就可以算出目前正在處理的資料點，應該要落在藍線上的何處，也就是lower+ratio的位置。

    ratio = dat_diff * step_quantile / max_diff
    
    transform[index] = lower_value + ratio

當我們每越過一個邊界，就要做更新。

    if(current_data == step_index):
        current_data = 0
        current_quantile = current_quantile + 1
        transform[index] = current_quantile * step_quantile
        index = index + 1

四、將資料轉成常態分佈

這一個步驟需要計算PPF，如果查了網站會發現好像不是很好算。但其實Python已經有對應的函式可以幫我們算出PPF了，大家直接套函式就可以算出來囉。

norm.ppf(transform)

五、最終轉換結果

我們來將原始資料、轉換成均勻分布的資料、轉換成常態分布的資料，用以下的程式碼畫出來看看吧。

plt.figure(1)
plt.subplot(1, 2, 1)
plt.title("Before Transform")
plt.hist(data)
plt.subplot(1, 2, 2)
plt.hist(transform)
plt.title("After Transform, Quantile = "+str(N_quantile))
plt.show()transform[transform == 0] = 1e-7
transform[transform == 1] = 1.0-1e-7plt.figure(2)
plt.subplot(1, 2, 1)
plt.title("Before Transform")
plt.hist(data)
plt.subplot(1, 2, 2)
plt.hist(norm.ppf(transform), bins = 50)
plt.title("After Transform, Quantile = "+str(N_quantile))
plt.show()

可以看到，本來兩座山的資料，變成了均勻分布。

再經過PPF之後，就可以變成類似常態分布囉！大家可能會觀察到轉換後的資料，在大約2的位置，好像突然衝高了。如果想要讓轉換的結果更漂亮，那需要改變N_quantile超參數。

事實上，Python已經有提供對應的函式，把整個Quantile Transform都做完，大家趕快來試試看吧。

QuantileTransformer(n_quantiles = (N_quantile + 1),
                    random_state = 0,
                    output_distribution = 'normal'
                    ).fit_transform(data.reshape(-1,1))

重點整理

1、Quantile Transform可以將資料轉換成常態分布，並且可以處理負數的資料。

2、Quantile Transform中的n_quantiles會影響轉換結果有多趨近常態分布。

關於作者

Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.