機器學習動手做Lesson 25 — Variation of Information Distance 是檢測集成分群效果的好方法
過去幾週我們介紹了 4 個集成(Ensemble)分群(Clustering)的方法(Majority Vote、Iterative Voting Consensus、Iterative Probabilistic Voting Consensus、Iterative Pairwise Consensus),以及 2 個評估集成分群效果的指標(Symmetric Distance、Rand Distance)。今天,我們要來介紹由一組團隊提出的評估集成分取的方法:Variation of Information Distance
一、Entropy
說明方法之前,我們要先介紹 Entropy。在分群問題中,某一個分群演算法的 Entropy 為「子群的資料數占全部資料數的比例」乘上「子群的資料數占全部資料數的比例的對數」,再加總,最後取負號。
二、Mutual Information
在集成分群問題中,我們可以計算集成與基學習器(Base Learner)的 Mutual Information。計算方式如下:
其中藍色是「集成的某個子群跟基學習器的某個子群交集的個數佔資料總數的比例」。
三、Variation of Information Distance
有了 Entropy 跟 Mutual Information 之後,就可以計算 Variation of Information Distance,公式如下:
其實就是集成與基學習器的 Mutual Information,除以「集成的 Entropy 乘上基學習器的 Entropy 再開根號」。
四、Python 程式
以下建立計算 Entropy 的函式。
def get_entropy(x, n_clusters):
p = np.zeros(n_clusters)
for i in range(n_clusters):
p[i] = (-1)*(x.count(i)/len(x))*np.log(x.count(i)/len(x))
return np.sum(p)
以下建立計算 Mutual Information 的函式。
def get_intersect_num(x, x_cluster, y, y_cluster):
x_index = [i for i in range(len(x)) if x[i] == x_cluster]
y_index = [i for i in range(len(y)) if y[i] == y_cluster]
return len(set(x_index) & set(y_index))def get_mutual_info(x, y, n_clusters):
p = np.zeros((n_clusters, n_clusters)) m = sys.float_info.min for i in range(n_clusters):
for j in range(n_clusters):
p_x_y = get_intersect_num(x, i, y, j) / len(x)
p_x = x.count(i) / len(x)
p_y = y.count(j) / len(y)
p[i][j] = p_x_y * np.log(m + p_x_y / (p_x * p_y))
return np.sum(p)
有了上述的函式後,就可以計算集成跟每個基學習器的 Variation of Information Distance。
def get_info_var_distance(x, n_clusters):
n_sample = len(x[0])
n_ensemble = len(x)
distance = np.zeros(n_ensemble)
for i in range(1, n_ensemble):
mutual_info = get_mutual_info(x[0], x[i], n_clusters)
learner_entropy = get_entropy(x[i], n_clusters)
ensemble_entropy = get_entropy(x[0], n_clusters)
distance[i-1] = mutual_info/
np.sqrt(learner_entropy * ensemble_entropy)
return np.sum(distance) / n_ensemble
五、參考資料
Nguyen, Nam & Caruana, Rich. (2007). Consensus Clusterings. Proceedings — IEEE International Conference on Data Mining, ICDM. 607–612.
六、關於作者
Chia-Hao Li received the M.S. degree in computer science from Durham University, United Kingdom. He engages in computer algorithm, machine learning, and hardware/software codesign. He was former senior engineer in Mediatek, Taiwan. His currently research topic is the application of machine learning techniques for fault detection in the high-performance computing systems.