UML_5: Agnostic PAC Learning

UML_5: Agnostic PAC Learning | 3.2.1.

28/04/2020 01:36

Statistical_Learning

sl

vapnik

3.2.1. Agnostic PAC Learning

In previous articles of UML series, we know that the realizability assumption requires that there exist $h^*\in H$ such that $P_{x\sim D}[h^*(x)=f(x)]=1$ . In fact, labels don't completely depend on the features. Then, we relax the realizability assumption by replacing the "target labeling function" with more flexible, a data-labels generating distribution.
We redefine $\mathcal{D}$ be a probability distribution over $\mathcal{X}\times\mathcal{Y}$ . So, $\mathcal{D}$ includes two parts: a distribution $\mathcal{D_x}$ (marginal distribution) and $\mathcal{D((x,y)|x)}$ (conditional probability).
True Error Revised:
$L_{\mathcal{D}}(h)=P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]=\mathcal{D}(\{(x,y):h(x)\neq y\})$
Goal: We wish to find some hypothesis $h:\mathcal{X\to Y}$ that minimizes $L_{\mathcal{D}}(h)$ .
The Bayes Optimal Predictor:
- Given $\mathcal{D}$ over $\mathcal{X}\times \{0,1\}$ , the best labeling function is:
  $f_{\mathcal{D}}(x)=\begin{cases} 1 \ \ \ \text{if $P[y=1|x]\ge0.5$}\\ 0 \ \ \ \text{otherwise} \end{cases}$
We can show that $f_{\mathcal{D}}$ is optimal, means with every $g:\mathcal{X}\to\{0,1\}$ , $L_{\mathcal{D}}(f_{\mathcal{D}})\le L_{\mathcal{D}}(g)$ .

Prove: With every $h:\mathcal{X\to Y}$ :

$\begin{align} L_{\mathcal{D}}(h)&=P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]\\ &=E_{(x,y)\sim\mathcal{D}}\begin{cases} \Pr[y\ne0|x] \ \ \ \text{if $h(x)=0$} \\ \Pr[y\ne1|x] \ \ \ \text{if $h(x)=1$} \end{cases} \end{align}$

Suppose we have $\Pr[y=1|x_i]$ , so we also have $\Pr[y=0|x_i]=1-\Pr[y=1|x_i]$ . If we choose $h(x_i)=1$ , $L_{\mathcal{D}}(h,x_i)=\Pr[y=0|x_i]=1-\Pr[y=1|x_i]$ . And if we choose $h(x_i)=0$ , $L_{\mathcal{D}}(h,x_i)=\Pr[y=1|x_i]$ .

We want $L_{\mathcal{D}}$ to be as small as possible. So if $\Pr[y=1|x_i]\ge1-\Pr[y=0|x_i]$ , we choose $h(x_i)=1$ . Similarly with opposite case, we can prove $f_{\mathcal{D}}$ is optimal.

Agnostic PAC Learning: A hypothesis $\mathcal{H}$ is agnostic PAC learnable if exist $m_{\mathcal{H}}(\epsilon, \delta):(0,1)^2\to\N$ with property:
- For every $\epsilon, \delta\in(0,1)$ , every distribution $\mathcal{D}$ over $\mathcal{X\times Y}$ , when running algorithm on $m\ge m_\mathcal{H}(\epsilon, \delta)$ generated by $\mathcal{D}$ , the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ :
  $L_{\mathcal{D}}(h)\le\min_{h'\in\mathcal{H}}L_{\mathcal{D}}(h')+\epsilon$
  If the realizability assumption holds, agnostic PAC learning generalizes the definition of PAC learning. If not, agnostic PAC learning can still guarantee success if its error is not much larger than the best error. It contrast to PAC learning, in which the learner is requires to achieve a small error and not relative to the best error.

UML_5: Agnostic PAC Learning | 3.2.1.

3.2.1. Agnostic PAC Learning

Reference: