ML&PR_4: Multinomial Variables

ML&PR_4: Multinomial Variables | 2.2.

08/05/2020 23:12

Machine_Learning_n_Pattern_Recognition

ml

bishop

stat

2.2. Multinomial Variables

Binary variables can only describe quantities one of two possible values. Now I will introduce to you the case of $K$ possible values. In order to conveniently, we represent them by a $K$ -dimensional vector $x$ in which the element $x_k$ equals $1$ and all remaining elements equal $0$ .

For instance, we represent $2$ where $K=3$ :
$\mathbb{x}=(0,1,0)^T$
Obviously we have $\sum_{k=1}^Kx_k=1$
If we denote the probability of $x_k=1$ by $\mu_k$ , the distribution of $x$ is:
$p(\mathbb{x}|\vec\mu)=\prod_{k=1}^K\mu_k^{x_k}\tag1$
where $\vec\mu=(\mu_1,\mu_2,...,\mu_K)^T$ , $\forall_{k=1}^K\mu_k\ge0$ and $\sum_{k=1}^K\mu_k=1$ .

It means:
$\sum_\mathbb{x}p(\mathbb{x}|\vec\mu)=\sum_{k=1}^K\mu_k=1\tag2$
and:
$\mathbb{E}[\mathbb{x}|\vec\mu]=\sum_xp(\mathbb{x}|\vec\mu)=(\mu_1,\mu_2,...,\mu_K)^T=\vec\mu\tag3$
Now consider a data set $\mathcal{D}=\{\mathbb{x}_1,\mathbb{x}_2,...,\mathbb{x}_N\}$ , we can show that:
$p(\mathcal{D}|\vec\mu)=\prod_{n=1}^N\prod_{k=1}^K\mu_k^{x_{nk}}=\prod_{k=1}^K\mu_k^{\sum_nx_{nk}}=\prod_{k=1}^K\mu_k^{m_k}\tag4$
where $m_k=\sum_nx_{nk}$ .
We can consider the joint distribution of the quantities $m_1,...,m_K$ , conditioned on $\vec\mu$ and the total of number $N$ observations:
$\text{Mult}(m_1,...,m_K|\vec\mu,N)=\begin{pmatrix} N\\ m_1m_2...m_K \end{pmatrix}\prod_{k=1}^K\mu_k^{m_k}\tag5$
Formula $(5)$ is known that Multinomial distribution, where:
$\begin{pmatrix}<br>N\\ m_1m_2...m_K<br>\end{pmatrix}=\frac{N!}{m_1!...m_K!}\tag6$
and note that:
$\sum_{k=1}^Km_k=N$

2.2.1. The Dirichlet distribution

We now introduce a prior distribution for parameter $\vec\mu$ of multinomial distribution. By inspection of the form of the multinomial distribution, we see that conjugate prior:
$p(\vec\mu|\vec\alpha)\propto\prod_{k=1}^K\mu_k^{\alpha_k-1}\tag7$
where $\vec\alpha$ denotes $(\alpha_1,...,\alpha_K)^T$ .
We can normalize $(7)$ :
$\text{Dir}(\vec\mu|\vec\alpha)=\frac{\Gamma(\sum_{i=1}^K\alpha_i)}{\Gamma(\alpha_1)...\Gamma(\alpha_K)}\prod\mu_k^{\alpha_k-1}\tag8$
called the Dirichlet distribution.

Multiplying the prior $(8)$ by the likelihood function $(5)$ , we obtain the posterior distribution:
$p(\vec\mu|\mathcal{D},\vec\alpha)\propto \prod_{k=1}^K\mu_k^{\alpha_k+m_k-1}\tag9$
Normalize $(9)$ , we obtain a other Dirichlet distribution:
$\begin{align}p(\vec\mu|\mathcal{D},\vec\alpha)&=\text{Dir}(\vec\mu|\vec\alpha+\vec m)\\&=\frac{\Gamma(\sum_{i=1}^K\alpha_i+N)}{\Gamma(\alpha_1+m_1)...\Gamma(\alpha_K+m_k)}\prod\mu_k^{\alpha_k+m_k-1}\tag{10}<br>\end{align}$

where $\vec m=(m_1,...,m_K)^T$ .

Reference:

2.2 | Pattern Recognition and Machine Learning | C.M.Bishop.