UML_6: Learning via Uniform Convergence

24/11/2021 09:36

Statistical_Learning

sl

vapnik

UML I.4 Learning via Uniform Convergence

Recall in previous posts, we discussed about the realizable assumption and ERM learning. We hope an hypothesis $h$ , when minimizing error on $S$ , also respect to $\mathcal{D}$ . In other words, we need all members of $\mathcal{H}$ are good approximations of their true risk.
Def 1( $\epsilon$ -representative sample): A training set $S$ is called $\epsilon$ -representative if
Lemma 1: Assume that a training set $S$ is $\frac{\epsilon}{2}$ -representative, any $h_S\in\arg\min_{h\in\mathcal{H}}L_S(h)$ satisfies
- This lemma implies that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least $1-\delta$ over the random choice of a $S$ , it will be an $\epsilon$ -representative training set.
- The proof in $\text{Appendix 1}$ .
Def 2 (Uniform Convergence): A hypothesis class $\mathcal{H}$ has the uniform convergence property if there exists a function $m_\mathcal{H}^{UC}:(0,1)^2\to\mathbb{N}$ such that for every $\epsilon,\delta\in(0,1)$ and for every probability distribution $\mathcal{D}$ , if $S$ is a sample of $m\ge m_\mathcal{H}^{UC}(\epsilon,\delta)$ examples drawn i.i.d according to $\mathcal{D}$ , then, with probability of at least $1-\delta$ , $S$ is $\epsilon$ -representative.
- The function $m_\mathcal{H}^{UC}$ measures the minimal sample complexity of obtaining the uniform convergence property, how many examples we need to ensure that with probability of at least $1-\delta$ the sample would be $\epsilon$ -representative.
- The term uniform here refers to having a fixed sample size that works for all members of $\mathcal{H}$ and over all possible probability distributions.
Corollary 1: If a class $\mathcal{H}$ has the uniform convergence property with a function $m_\mathcal{H}^{UC}$ then the $\mathcal{H}$ is agnostic PAC with the sample complexity $m_\mathcal{H}(\epsilon,\delta)\le m_\mathcal{H}^{UC}(\epsilon/2,\delta)$ . Furthermore, the $ERM_\mathcal{H}$ is a successful agnostic PAC learner for $\mathcal{H}$ .
In this section, we will show that uniform convergence holded if $\mathcal{H}$ is finite hypothesis class.
- We need to find a sample size $m$ that guarantees that for any $\mathcal{D}$ , with probability of at least $1-\delta$ of the choice of $S=(s_1,...,s_m)$ sampled i.i.d from $\mathcal{D}$ we have
- We need to show that
- Rewrite left-hand side as
Apply the Union bound (mentioned in UML_3), we obtain:
- Recall that $L_\mathcal{D}(h)=\mathbb{E}_{s\sim\mathcal{D}}[l(h,s)]$ and $L_S(h)=\frac{1}{m}\sum_{i=1}^ml(h,s_i)$ (where $l$ is the loss function). Because of each $s_i$ is sampled from $\mathcal{D}$ , the distribution $L_S(h)$ tend to $L_\mathcal{D}(h)$ if $m\to\infty$ . This is the law of large number.
- To measure the gap $|L_S(h)-L_\mathcal{D}(h)|$ that depend on $\epsilon$ , $\delta$ and sample size $m$ , we use Hoeffding's inequality.
- Lemma 2 (Hoeffding's inequality): Let $\theta_1,...,\theta_m$ be a sequence of i.i.d random variables and assume that for all $i$ , $\mathbb{E}[\theta_i]=\mu$ and $\mathbb{P}[a\le\theta_i\le b]=1$ . Then, for any $\epsilon>0$
  
  The proof in $\text{Appendix 2}$ .
- From Lemma 2, we can infer the right-hand side term of formula $(6)$ to
  
  and yeilds
- Corollary 2: For $\mathcal{H}$ is a finite hypothesis class. The $\mathcal{H}$ has uniform convergence property with
  
  and the $\mathcal{H}$ is agnostically PAC learnable using ERM algorithm with

Appendix

Given a training set $S$ is $\frac{\epsilon}{2}$ -represensitative and $h_S\in\arg\min_{h\in\mathcal{H}}L_S(h)$ , we have

Proof:
(Hoeffding's inequality) Let $\theta_1,...,\theta_m$ be a sequence of i.i.d random variables and $\overline{\theta}=\frac{1}{m}\sum_{i=1}^m\theta_i$ . Assume $\mathbb{E}[\overline{\theta}]=\mu$ and $\mathbb{P}[a\le \theta_i\le b]=1$ for every $i$ . For any $\epsilon>0$ :

Proof:
- Denote $X_i=\theta_i-\mathbb{E}[\theta_i]$ and $\overline{X}=\frac{1}{m}\sum_{i=1}^m X_i$ . For every $\lambda>0$ and $\epsilon>0$ :
  
  The second line use Markov inequality, the proof in $\text{Appendix 3}$ .
- Because of i.i.d property, we have:
- By Hoeffding lemma, for every $i$
- Combine $(10),(11),(12)$ , we obtain
- Set $\lambda=\frac{4m\epsilon}{(b-a)^2}$ , rewrite as $\mathbb{P}[\overline{X}\ge\epsilon]\le e^{\frac{-2m\epsilon^2}{(b-a)^2}}$ .
- In similar way, we can show that $\mathbb{P}[\overline{X}\le\epsilon]\le e^{\frac{-2m\epsilon^2}{(b-a)^2}}$ for any $\epsilon<0$ .
- Therefore, we obtain $\mathbb{P}[|\overline{X}|\ge\epsilon]\le 2e^{\frac{-2m\epsilon^2}{(b-a)^2}}$ .
(Markov's inequality) For a nonnegative random variable $X$ and any $a\ge0$ :
(Hoeffding's lemma) Given random variable $X\in[a,b]$ and $\mathbb{E}[X]=0$ . For any $\lambda>0$ :

Proof
- Because $f(x)=e^{\lambda x}$ is a convex function, for every $\alpha\in[0,1]$ : $f(x)\le\alpha f(a)+(1-\alpha)f(b)$ .
- Set $\alpha=\frac{b-x}{b-a}\in[0,1]$ , yields:
  
  Take the expectation:
- We need to prove
- We first reformulate left-hand side of $(18)$ :
  
  Set $h=\lambda(b-a)$ and $p=\frac{a}{b-a}$ , problem $(18)$ turn to
  
  Have:
  - $F(0)=0$ ,
  - $F'(0)=0$ ,
  - $F''(h)=-\frac{pe^h}{1+p-pe^h}(1+\frac{pe^h}{1+p-pe^h})\le\frac{1}{4}$ for every $h$ (because of $\max[-x(1-x)]=\frac{1}{4}$ ).
  So, with $h\ge0$
  
  And we are done.

Reference:

Understanding machine learning-theory algorithms.