Khoa Blog

ML&PR_4: Multinomial Variables | 2.2.

08/05/2020 23:12

2.2. Multinomial Variables

Binary variables can only describe quantities one of two possible values. Now I will introduce to you the case of $K$ possible values. In order to conveniently, we represent them by a $K$ -dimensional vector $x$ in which the element $x_k$ equals $1$ and all remaining elements equal $0$ . For instance, we represent $2$ where $K=3$ : $\mathbb{x}=(0,1,0)^T$ Obviously we have $\sum_{k=1}^Kx_k=1$
If we denote the probability of $x_k=1$ by $\mu_k$ , the distribution of $x$ is: $p(\mathbb{x}|\vec\mu)=\prod_{k=1}^K\mu_k^{x_k}\tag1$ where $\vec\mu=(\mu_1,\mu_2,...,\mu_K)^T$ , $\forall_{k=1}^K\mu_k\ge0$ and $\sum_{k=1}^K\mu_k=1$ . It means: $\sum_\mathbb{x}p(\mathbb{x}|\vec\mu)=\sum_{k=1}^K\mu_k=1\tag2$ and: $\mathbb{E}[\mathbb{x}|\vec\mu]=\sum_xp(\mathbb{x}|\vec\mu)=(\mu_1,\mu_2,...,\mu_K)^T=\vec\mu\tag3$
Now consider a data set $\mathcal{D}=\{\mathbb{x}_1,\mathbb{x}_2,...,\mathbb{x}_N\}$ , we can show that: $p(\mathcal{D}|\vec\mu)=\prod_{n=1}^N\prod_{k=1}^K\mu_k^{x_{nk}}=\prod_{k=1}^K\mu_k^{\sum_nx_{nk}}=\prod_{k=1}^K\mu_k^{m_k}\tag4$ where $m_k=\sum_nx_{nk}$ .
We can consider the joint distribution of the quantities $m_1,...,m_K$ , conditioned on $\vec\mu$ and the total of number $N$ observations: $\text{Mult}(m_1,...,m_K|\vec\mu,N)=\begin{pmatrix} N\\ m_1m_2...m_K \end{pmatrix}\prod_{k=1}^K\mu_k^{m_k}\tag5$ Formula $(5)$ is known that Multinomial distribution, where: $\begin{pmatrix} N\\ m_1m_2...m_K \end{pmatrix}=\frac{N!}{m_1!...m_K!}\tag6$ and note that: $\sum_{k=1}^Km_k=N$

ML&PR_3: Binary Variables | 2.1.

05/05/2020 14:41

2.1. Binary variables

Let imagine when flip a coin, the outcome $x$ is 'heads' or 'tails'. We represent it by numerals $x=1$ representing 'heads' and $x=0$ representing 'tails'. Now we call $x$ is binary variable.
The probability of $x=1$ will be denoted by the parameter $\mu$ where $0\le\mu\le1$ : $p(x=1|\mu)=\mu\tag1$ and $p(x=0|\mu)=1-\mu$ . The probability distribution of $x$ is a Bernoulli distribution: $\text{Bern}(x|\mu)=\mu^x(1-\mu)^{1-x}\tag2$ And we have: $\begin{align} \mathbb{E}[x]&=\mu\tag3\\ \text{var}[x]&=\mu(1-\mu)\tag4 \end{align}$
In general, when we flip a coin $N$ times and $x=1$ appears $m$ times. This is called Binomial distribution: $\text{Bin}(m|N,\mu)=\begin{pmatrix} N\\m \end{pmatrix} \mu^m(1-\mu)^{N-m}\tag5$ where $\begin{pmatrix} N\\m \end{pmatrix}=\frac{N!}{(N-m)!m!}\tag6$ And: $\begin{align} \mathbb{E}[m]&=N\mu\tag7\\ \text{var}[m]&=N\mu(1-\mu)\tag8 \end{align}$

UML_5: Agnostic PAC Learning | 3.2.1.

28/04/2020 01:36

3.2.1. Agnostic PAC Learning

In previous articles of UML series, we know that the realizability assumption requires that there exist $h^*\in H$ such that $P_{x\sim D}[h^*(x)=f(x)]=1$ . In fact, labels don't completely depend on the features. Then, we relax the realizability assumption by replacing the "target labeling function" with more flexible, a data-labels generating distribution.
We redefine $\mathcal{D}$ be a probability distribution over $\mathcal{X}\times\mathcal{Y}$ . So, $\mathcal{D}$ includes two parts: a distribution $\mathcal{D_x}$ (marginal distribution) and $\mathcal{D((x,y)|x)}$ (conditional probability).
True Error Revised: $L_{\mathcal{D}}(h)=P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]=\mathcal{D}(\{(x,y):h(x)\neq y\})$
Goal: We wish to find some hypothesis $h:\mathcal{X\to Y}$ that minimizes $L_{\mathcal{D}}(h)$ .
The Bayes Optimal Predictor:
- Given $\mathcal{D}$ over $\mathcal{X}\times \{0,1\}$ , the best labeling function is: $f_{\mathcal{D}}(x)=\begin{cases} 1 \ \ \ \text{if $P[y=1|x]\ge0.5$}\\ 0 \ \ \ \text{otherwise} \end{cases}$

UML_4: PAC Learning | 3.1.

26/04/2020 23:23

3.1. PAC learning

In UML_3, we have shown that for a finite $\mathcal{H}$ hypothesis class and a sufficiently large training sample (follow the $\mathcal{D}$ distribution and labeling function $f$ ) then output of $\mathcal{H}$ will be probably approximately correct. Now, we define Probably Approximately Correct (PAC) learning.
Define (PAC learning): is PAC learnable if there exist a function with property:
- For every $(\epsilon,\delta)\in(0,1)$ , every $\mathcal{D}$ over $\mathcal{X}$ and every $f:\mathcal{X}\to\{0,1\}$ , the relizable assumption (Giả thuyết tính khả thi in UML_3) holds with respect $\mathcal{H,D,f}$ , then when running the learning algorithm on $m\ge m_{\mathcal{H}}(\epsilon,\delta)$ , the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ that we have $L_{(\mathcal{D,f})}(h)\le\epsilon$ .

ML&PR_2: Principal Component Analysis: Maximum variance formulation | 12.1.1.

17/04/2020 21:52

12.1. Principal Component Analysis (PCA)

PCA is a technique is widely used for: dimensionality reduction, lossy data compression, feature extraction and data visualization.