menu

UML_6: Learning via Uniform Convergence

UML I.4 Learning via Uniform Convergence

  • Recall in previous posts, we discussed about the realizable assumption and ERM learning. We hope an hypothesis , when minimizing error on , also respect to . In other words, we need all members of are good approximations of their true risk.

  • Def 1(-representative sample): A training set is called -representative if

  • Lemma 1: Assume that a training set is -representative, any satisfies

    • This lemma implies that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least over the random choice of a , it will be an -representative training set.
    • The proof in .
  • Def 2 (Uniform Convergence): A hypothesis class has the uniform convergence property if there exists a function such that for every and for every probability distribution , if is a sample of examples drawn i.i.d according to , then, with probability of at least , is -representative.

    • The function measures the minimal sample complexity of obtaining the uniform convergence property, how many examples we need to ensure that with probability of at least the sample would be -representative.
    • The term uniform here refers to having a fixed sample size that works for all members of and over all possible probability distributions.
  • Corollary 1: If a class has the uniform convergence property with a function then the is agnostic PAC with the sample complexity . Furthermore, the is a successful agnostic PAC learner for .

  • In this section, we will show that uniform convergence holded if is finite hypothesis class.

    • We need to find a sample size that guarantees that for any , with probability of at least of the choice of sampled i.i.d from we have
    • We need to show that
    • Rewrite left-hand side as

    Apply the Union bound (mentioned in UML_3), we obtain:

    • Recall that and (where is the loss function). Because of each is sampled from , the distribution tend to if . This is the law of large number.

    • To measure the gap that depend on , and sample size , we use Hoeffding's inequality.

    • Lemma 2 (Hoeffding's inequality): Let be a sequence of i.i.d random variables and assume that for all , and . Then, for any

      The proof in .

    • From Lemma 2, we can infer the right-hand side term of formula to

      and yeilds

    • Corollary 2: For is a finite hypothesis class. The has uniform convergence property with

      and the is agnostically PAC learnable using ERM algorithm with

Appendix
  1. Given a training set is -represensitative and , we have

    Proof:

  2. (Hoeffding's inequality) Let be a sequence of i.i.d random variables and . Assume and for every . For any :

    Proof:

    • Denote and . For every and :

      The second line use Markov inequality, the proof in .

    • Because of i.i.d property, we have:

    • By Hoeffding lemma, for every

    • Combine , we obtain

    • Set , rewrite as .

    • In similar way, we can show that for any .

    • Therefore, we obtain .

  3. (Markov's inequality) For a nonnegative random variable and any :

  4. (Hoeffding's lemma) Given random variable and . For any :

    Proof

    • Because is a convex function, for every : .

    • Set , yields:

      Take the expectation:

    • We need to prove

    • We first reformulate left-hand side of :

      Set and , problem turn to

      Have:

      • ,
      • ,
      • for every (because of ).

      So, with

      And we are done.

 

Reference: