menu

ML&PR_7: Cross Entropy, Kullback-Leibler divergence and Jensen-Shannon divergence

mlpr7

1. Cross entropy
  • Cross-entropy is a measure the difference between two probability of distributions and , denoted as:

  • Cross-entropy calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution. In formula , is the target distribution, is the approximation of the target distribution.

  • By Lagrange multiplier method, we can prove reaches the minimum value when , it means:

Note that: In this blog, we denote:

  • entropy of as
  • entropy of as
  • cross entropy of and as .
2. Kullback-Leibler divergence
  • KL divergence or relative entropy is a method to measure the dissimilarity of two probability distribution, denoted by and . It is defined:

    In discrete domain, KL divergence is written as:

    Difference from cross-entropy, KL divergence is the average number of extra bits needed to encode data when we use distribution instead of true distribution .

  • We can rewrite:

Theorem 1: with equality iff .

Proof:

To prove this, we use Jensen's inequality for a convex function:

where and .

We have:

2.1. Jensen-Shannon divergence
  • We can see that, both of cross-entropy and KL divergence are asymmetric. So, both of them cannot used as a measure for two distribution. So, JS divergence measure is built based on KL divergence:

  • JS divergence calculates the dissimilarity of two distribution and adopted the dissimilarity of vs and vs . If and are more match, KL divergence of both of and with are smaller and more close as 0.

  • From and , we can rewrite as:

  • Next, we prove JS divergence has upper limit. We start with:

    We have

    so

    It is equal when with all of values.

    Similarity, we have:

    It is also equal when with all of values.

    Then

Reference: