機器學習動手做Lesson 9— 與職場息息相關的Pearson、 Spearman、Kendall相關係數(下篇)

15 min readAug 6, 2021

我們上篇講解了Pearson 和Spearman相關係數，若是有感興趣的人也可以回到上篇去看。

機器學習動手做Lesson 8 — 與職場息息相關的Pearson、 Spearman、Kendall相關係數(上篇)：上篇連結

接下來這篇我們要介紹Kendall相關係數。

肯德爾等級相關係數（Kendall correlation coefficient）是一種等級相關係數，與 Spearman相關係數一樣是單調關係，但算法不同，它是透過比較兩組數值（Xi，Yi）、（Xj，Yj）排序方向作為計算方式。

若i < j， Xi < Xj 且 Yi < Yj （or Xi > Xj 且 Yi > Yj），它們方向順序一致，稱為和諧數對（concordant pairs）。

如果Xi < Xj 且 Yi > Yj （or Xi > Xj 且 Yi < Yj），它們方向順序相反，稱為不和諧數對（discordant pairs）。

如果 Xi = Xj or Yi = Yj，發生排序並列，這情形不包含在和諧數對，也不包含在不和諧數對，所以計算將會需要調整。

一、Kendall tau-a

在未出現並列的情況，如下圖：

會使用 tau-a 公式進行計算，其公式如下：

和諧對數總數為6+5+3+3+2+1+0 = 20

不和諧對數總數為0+0+1+0+0+0+0 = 1

tau-a=(20–1)/(20+1)=19/21≒0.905

使用Python程式

# -*- coding: UTF-8 -*-%matplotlib inlineimport pandas as pd
import matplotlib.pyplot as pltX=pd.Series([1, 3, 5, 7, 9])
Y=pd.Series([5, 25, 125, 625, 3125])plt.plot(X,Y)
plt.show()print(“Kendall套件相關係數:”+str(round(X.corr(Y,method=’kendall’),2)))print(“==============================================”)print(“使用Kendall公式”)cn2=(len(X)*(len(Y)-1))/2
print(“總數對:”+str(round(cn2)))concordant=0
discordant=0for i in range(len(X)):
    for j in range(len(X)):
        if i<j:
            if (X[i]<X[j] and Y[i]<Y[j]) or (X[i]>X[j] and Y[i]>Y[j]):
                concordant+=1
            if (X[i]>X[j] and Y[i]<Y[j]) or (X[i]<X[j] and Y[i]>Y[j]):
                discordant+=1print(“和諧數對:”+str(concordant))
print(“不和諧數對:”+str(discordant))print(“Kendall公式相關係數:”+str((concordant-discordant)/cn2))

結果

上面結果可知，pandas套件與自行計算結果一致。

二、Kendall tua-b

在出現重複值，出現並排排序的情形，如下圖：

在此情形，使用 tau-b 公式，會考量排序並列，透過調整使相關係數數值介於-1和1之間，

和諧對數總數=9+8+7+5+5+3+3+0+0+0 = 40

不和諧對數總數=0+0++0+0+1+0+0+0+0 = 1

使用Python程式

(建議更新至SciPy 1.6.0以上)

# -*- coding: UTF-8 -*-
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import math
from scipy import statsX=pd.Series([1, 3, 5, 7, 9, 11, 13, 15, 15, 15])
Y=pd.Series([2, 4, 6, 8, 8, 14,12,16,18,20])tau_b, p_value = stats.kendalltau(X, Y, variant='b')
print("Kendall tua_b套件相關係數:"+str(round(tau_b,3)))print("==============================================")
cn2=(len(X)*(len(Y)-1))/2
print("總數對:"+str(round(cn2)))
concordant=0
discordant=0
n1_size=[]
n2_size=[]
n1=0
n2=0
for i in range(len(X)):
    for j in range(len(X)):
        if i<j:
            if (X[i]<X[j] and Y[i]<Y[j]) or (X[i]>X[j] and Y[i]>Y[j]):
                concordant+=1
            if (X[i]>X[j] and Y[i]<Y[j]) or (X[i]<X[j] and Y[i]>Y[j]):
                discordant+=1
print("和諧數對:"+str(concordant))
print("不和諧數對:"+str(discordant))for i in X.value_counts():
    if i>1:
         n1_size.append(i)
print("n1_size:"+str(n1_size))for i in n1_size:
    n1+=int(i*(i-1)/2)
print("n1:"+str(n1))for j in Y.value_counts():
    if j>1:
         n2_size.append(j)
print("n2_size:"+str(n2_size))for j in n2_size:
    n2+=int(j*(j-1)/2)
print("n2:"+str(n2))print("Kendall tau_b公式相關係數:"+str(round((concordant-discordant)/(math.sqrt(cn2-n1)*math.sqrt(cn2-n2)),3)))

結果

scipy.stats套件計算結果與程式公式計算、人工計算結果一致。

三、Kendall tau-c

tau-c 又稱 Stuart-Kendall tau-c，此方法適合使用在矩形(長方形)列聯表，tau-c公式為：

在資料非列聯表情況下，r=Yi所有值的種類個數，c=Xi所有值的種類個數

使用與tau-b相同範例，我們來計算tau-c

和諧對數總數=9+8+7+5+5+3+3+0+0+0 = 40

不和諧對數總數=0+0++0+0+1+0+0+0+0 = 1

n=10，m=min(9,8)=8

使用Python程式

# -*- coding: UTF-8 -*-
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import mathX=pd.Series([1, 3, 5, 7, 9, 11, 13, 15, 15, 15])
Y=pd.Series([2, 4, 6, 8, 8, 14,12,16,18,20])tau_c, p_value = stats.kendalltau(X, Y, variant='c')
print("Kendall tua_c套件相關係數:"+str(round(tau_c,3)))
print("==============================================")
cn2=(len(X)*(len(Y)-1))/2
print("總數對:"+str(round(cn2)))
concordant=0
discordant=0for i in range(len(X)):
    for j in range(len(X)):
        if i<j:
            if (X[i]<X[j] and Y[i]<Y[j]) or (X[i]>X[j] and Y[i]>Y[j]):
                concordant+=1
            if (X[i]>X[j] and Y[i]<Y[j]) or (X[i]<X[j] and Y[i]>Y[j]):
                discordant+=1
print("和諧數對:"+str(concordant))
print("不和諧數對:"+str(discordant))n=len(X)
print("n:"+str(n))m=min(X.nunique(),Y.nunique())
print("m:"+str(m))print("Kendall tau_c公式相關係數:"+str(round(2*(concordant-discordant)/((math.pow(n,2))*(m-1)/m),3)))

結果

scipy.stats套件計算結果與程式公式計算、人工計算結果一致。

經過計算 tau-b 和 tau-c 我們能發現，在此範例下tau-b的相關係數比tau-c來的高。

相同資料但計算結果不同，怎麼會這樣呢？接下來我們來深入探討。

若資料表-圖(A)變為 3 X 3 列聯表(Contingency Table)-圖(B)

和諧對數總數 = 2 x 1 + 2 x 1 + 1 x 1 =5

不和諧對數總數 = 0

計算 tau-b：

計算 tau-c：

使用Python程式

X= [1, 1, 2, 3]
Y = [1, 1, 2, 3]from scipy import statstau_b, p_value = stats.kendalltau(X, Y, variant=’b’)
tau_c, p_value = stats.kendalltau(X, Y, variant=’c’)print(tau_b, tau_c)

結果

scipy.stats套件計算結果與人工計算結果一致，而且能發現在 n x n 列聯表下，使用 tau-b 計算相關係數為1是正確的，而tau-c 只是近似於1的數值，tau-b 比 tau-c 精確度更高，更適合用在這情形。

========================================

若資料表-圖(C) 變為 3 X 6 列聯表(Contingency Table)-圖(D)

和諧對數總數 =

1 x 1 + 1 x 1 + 1 x1 + 1 x 1

+ 1 x 1 + 1 x 1 + 1 x1 + 1 x 1

+ 1 x 1 + 1 x 1

+ 1 x1 + 1 x 1 = 12

不和諧對數總數 = 0

計算 tau-b：

計算 tau-c：

使用Python程式

X = [1, 2, 3, 4, 5, 6]
Y = [1, 1, 2, 2, 3, 3]from scipy import statstau_b, p_value = stats.kendalltau(X, Y, variant=’b’)
tau_c, p_value = stats.kendalltau(X, Y, variant=’c’)print(tau_b, tau_c)

結果

scipy.stats套件計算結果與人工計算結果一致，而且能發現在 m x n 列聯表下，情形剛好與 n x n 列聯表相反，tau-c 計算結果為1，而 tau-b 只是近似於1，tau-c 比 tau-b 精確度更高，更適合用在這情形。

重點整理

(1) Kendall tau-a 公式適用於沒有重複值之情形

(2) Kendall tau-b 和 tau-c公式都適用於有重複值之情形

(3) Kendall tau-b 和 tau-c公式使用差別於兩個變數名次個數是否相同，tau-b適用有相同個數，tau-c適用有不同個數

參考資料

1. Joseph Magiya., (2019). ‘Kendall Rank Correlation Explained’. towards data science. [Accessed: 12 July 2021] . Available from: https://towardsdatascience.com/kendall-rank-correlation-explained-dee01d99c535

2. ‘ Kendall rank correlation coefficient’ (2021). Wikipedia. [Accessed: 12 July 2021]. Available from: https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

3. ‘Kendall等級相關係數’ (2021) .MBA智庫百科. [Accessed: 12 July 2021]. Available from: https://wiki.mbalib.com/zh-tw/Kendall%E7%AD%89%E7%BA%A7%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0

4. ‘Kendall’s Rank Correlation’(2020). StatsDirect. [Accessed: 12 July 2021]. Available from: https://www.statsdirect.com/help/nonparametric_methods/kendall_correlation.htm

5. Stephanie.,(2016). ‘Kendall’s Tau（Kendall Rank Correlation Coefficient）’. Statistics How To. [Accessed: 12 July 2021]. Available from: https://www.statisticshowto.com/kendalls-tau/

6. Peter, Y. Chen. and Puala, M. Popovich. (2002) .’Quantitative applications in the social science’. SAGE Publications. [Accessed: 15 July 2021]. Available from: https://rufiismada.files.wordpress.com/2012/02/correlation__parametric_and_nonparametric_measures__quantitative_applications_in_the_social_sciences_.pdf

7. ‘SciPy 1.6.0 Release Notes’(2021). The SciPy community. [Accessed: 18 July 2021]. Available from: https://docs.scipy.org/doc/scipy/release.1.6.0.html

8. Jorge. L. Mendoza. ‘Contingency Tables’. The University of OKLAHOMA. [Accessed: 16 July 2021]. Available from: https://www.ou.edu/faculty/M/Jorge.L.Mendoza-1/psy5013/Contingency%20Tables.pdf

關於作者

施威銘研究室。致力開發AI領域的圖書、創客、教具，希望培養更多的AI人才。整合各種人才，投入創客產品的開發，推廣「實作學習」，希望實踐學以致用的理想。

機器學習動手做Lesson 9— 與職場息息相關的Pearson、 Spearman、Kendall相關係數(下篇)

一、Kendall tau-a

二、Kendall tua-b

三、Kendall tau-c

重點整理

參考資料

關於作者

Written by 施威銘研究室

No responses yet