ML&PR_7: Cross Entropy, Kullback-Leibler divergence and Jensen-Shannon divergence
-
date_range 07/08/2020 23:39 infosortMachine_Learning_n_Pattern_Recognitionlabelmlbishopstat
1. Cross entropy
Cross-entropy is a measure the difference between two probability of distributions and , denoted as:
Cross-entropy calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution. In formula , is the target distribution, is the approximation of the target distribution.
By Lagrange multiplier method, we can prove reaches the minimum value when , it means:
Note that: In this blog, we denote:
- entropy of as
- entropy of as
- cross entropy of and as .
2. Kullback-Leibler divergence
KL divergence or relative entropy is a method to measure the dissimilarity of two probability distribution, denoted by and . It is defined:
In discrete domain, KL divergence is written as:
Difference from cross-entropy, KL divergence is the average number of extra bits needed to encode data when we use distribution instead of true distribution .
We can rewrite:
Theorem 1: with equality iff .
Proof:
To prove this, we use Jensen's inequality for a convex function:
where and .
We have:
2.1. Jensen-Shannon divergence
We can see that, both of cross-entropy and KL divergence are asymmetric. So, both of them cannot used as a measure for two distribution. So, JS divergence measure is built based on KL divergence:
JS divergence calculates the dissimilarity of two distribution and adopted the dissimilarity of vs and vs . If and are more match, KL divergence of both of and with are smaller and more close as 0.
From and , we can rewrite as:
Next, we prove JS divergence has upper limit. We start with:
We have
so
It is equal when with all of values.
Similarity, we have:
It is also equal when with all of values.
Then
Reference:
- 2.8.2 | Machine Learning A Probabilistic Perspective | K.P. Murphy.
- Jensen's inequality.
- Cross-entropy for machine learning | Machine Learning Mastery.
- Bài 44 - Model Wasserstein GAN (WGAN).