UML_6: Learning via Uniform Convergence
-
date_range 24/11/2021 09:36 infosortStatistical_Learninglabelslvapnik
Recall in previous posts, we discussed about the realizable assumption and ERM learning. We hope an hypothesis
, when minimizing error on , also respect to . In other words, we need all members of are good approximations of their true risk. Def 1(
-representative sample): A training set is called -representative if Lemma 1: Assume that a training set
is -representative, any satisfies - This lemma implies that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least
over the random choice of a , it will be an -representative training set. - The proof in
.
- This lemma implies that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least
Def 2 (Uniform Convergence): A hypothesis class
has the uniform convergence property if there exists a function such that for every and for every probability distribution , if is a sample of examples drawn i.i.d according to , then, with probability of at least , is -representative. - The function
measures the minimal sample complexity of obtaining the uniform convergence property, how many examples we need to ensure that with probability of at least the sample would be -representative. - The term uniform here refers to having a fixed sample size that works for all members of
and over all possible probability distributions.
- The function
Corollary 1: If a class
has the uniform convergence property with a function then the is agnostic PAC with the sample complexity . Furthermore, the is a successful agnostic PAC learner for . In this section, we will show that uniform convergence holded if
is finite hypothesis class. - We need to find a sample size
that guarantees that for any , with probability of at least of the choice of sampled i.i.d from we have
- We need to show that
- Rewrite left-hand side as
Apply the Union bound (mentioned in UML_3), we obtain:
Recall that
and (where is the loss function). Because of each is sampled from , the distribution tend to if . This is the law of large number. To measure the gap
that depend on , and sample size , we use Hoeffding's inequality. Lemma 2 (Hoeffding's inequality): Let
be a sequence of i.i.d random variables and assume that for all , and . Then, for any The proof in
. From Lemma 2, we can infer the right-hand side term of formula
to and yeilds
Corollary 2: For
is a finite hypothesis class. The has uniform convergence property with and the
is agnostically PAC learnable using ERM algorithm with
- We need to find a sample size
Appendix
Given a training set
is -represensitative and , we have Proof:
(Hoeffding's inequality) Let
be a sequence of i.i.d random variables and . Assume and for every . For any : Proof:
Denote
and . For every and : The second line use Markov inequality, the proof in
. Because of i.i.d property, we have:
By Hoeffding lemma, for every
Combine
, we obtain Set
, rewrite as . In similar way, we can show that
for any . Therefore, we obtain
.
(Markov's inequality) For a nonnegative random variable
and any : (Hoeffding's lemma) Given random variable
and . For any : Proof
Because
is a convex function, for every : . Set
, yields: Take the expectation:
We need to prove
We first reformulate left-hand side of
: Set
and , problem turn to Have:
, , for every (because of ).
So, with
And we are done.