Cousera机器学习基石第二周笔记 Machine Learning Foundation Week 2 Note in Cousera

Learning to Answer Yes/No

Perceptron Hypothesis Set

Example:Credit Approval Problem Revisited

A Simple Hypothesis Set:the ‘Perceptron’

  • For x=(x1,x2,...,xd) ‘feature of customer’,compute a weighted’score’ and

    <center> $approve \quad credit\quad if \sum_{i=1}^{d} w_i x_i>threshold$<center>

    <center> $deny \quad credit\quad if \sum_{i=1}^{d} w_i x_i<threshold$<center>

  • y : { + 1(good),  − 1(bad)},0 ignored-linear formulah ∈ Hare

    <center>$h(x)=sign((\sum_{i=i}^{d}w_i x_i)-threshold)$<center>

this is called’perceptron’hypothesis histroically

Vector Form of Perceptron Hypothesis

$$

$$

  • each’tall’w represents a hypothesis h &is multiplied with ‘tall’ x——will use tall versions to simplify notation

Q:What does h look like

Perceptrons in R2

perceptronslinear(binary) classifiers

Perceptron Learning Algorithm (PLA)

Select g from H

  • want: gf(hard when f unkown)
  • almost necessary:gf on D,ideally g(xn) = f(xn) = yn
  • difficult:H is of infinite size
  • idea: start from some g0,and ‘correct’ its mistakes on D

Perceptron Learning Algorithm

For t=0,1,…

1. find a mistake of wt called(xn(t),yn(t))

sign(WtTXn(t)) ≠ yn(t)

2. (try to) correct the mistake by

wt + 1 ← wt + yn(t)xn(t)

…until no more mistakes

return last w (called wPLA )as g

My question:why determinant is like vector

Pratical implementation of PLA

Cyclic PLA

Some Remaning issues of PLA

### Algorithmic:halt(no mistake)? - naive cyclic:?? - random cyclic:?? - other variant:??

Learning:g ≈ f ?

  • on D, if halt,yes(no mistake)
  • outside D:??
  • if not halting:??

Guarantee of PLA

Linear Separability

  • if PLA halts(i.e. no more mistakes), (necessary condition)D allows some w to make no mistakes
  • call such D linear separable

PLA Fact: wtGets More Aligned with wf

  • wf perfect hence every xn correctly away from line: $y_{n(t)}w_f^T x_{n(t)}\ge \underset{n}{min} y_{n(t)}w_f^T x_{n(t)}>0$

  • wfTwtby updating with any(xn(t),yn(t)) $$ \begin{split} w_f^Tw_{t+1}&=w_f^T(w_t+y_{n(t)}x_{n(t)})\\ &\ge w_f^Tw_{t}+\underset{n}{min} y_{n(t)}w_f^T x_{n(t)}\\ &> w_f^Tw_{t}+0 \end{split} $$

wt appears more algned with wf

Q:the length of vector

PLA fact: wt Does Not Grow Too Fast

wt changed only when mistake

 ⇔ sign(wtTxn(t)) ≠ yn(t) ⇔ yn(t)wtTxn(t) ≤ 0

  • mistake ‘limits’||wt||2 growth,even when updating with’longest’ xn $$ \begin{split} ||w_{t+1}||^2&=||w_t+y_{n(t)}x_{n(t)}||^2\\ &=||w_t||^2+2y_{n(t)}w_t^Tx_{n(t)}+||y_{n(t)}x_{n(t)}||^2\\ &\le ||w_t||^2+0+||y_{n(t)}x_{n(t)}||^2\\ &\le||w_t||^2+\underset{n}{max}||y_nx_n||^2 \end{split} $$

start from w0 = 0,afterT mistake corrections, $$ \frac{w_f^T\quad w_T}{||w_f||\quad||w_T||}\ge \sqrt{T}\cdot constant $$

Novikoff theorem

ref:Lihangpage 31

设训练数据集T=(x1,y1), (x2,y2), ..., (xN,yN)是线性可分的,其中xi ∈ 𝒳 = Rn, yi ∈ 𝒴 = −1,  + 1, i = 1, 2, ..., N,则

(1)存在满足条件||opt|| = 1的超平面opt ⋅  + bopt = 0将训练数据集完全正确分开,且存在γ > 0,对所有i = 1, 2, ..., N yi(opti) = yi(opt ⋅ i + bopt ≥ γ

(2)令$R=\underset{1\le i\le N}{max}||\hat{x}_i||$,则感知机算法在训练数据集上的误分类次数k满足不等式 $$ k\le (\frac{R}{\gamma})^2 $$

Non-Separable Data

More about PLA

CONS:maybe not’linnear separable’,and not fully sure how long halting takes

Learning with Noisy Data

Line with Noise Tolerance

NP-hard

Pocket Algorithm

Find the least mistakes until iterations

Summary

  • Perceptron Hypothesis Set

    hyperplanes/linear classifiers in d

  • Perceptron Learning Algorithm(PLA)

    correct mistakes and improve iteratively

  • Guarantee of PLA

    no mistake eventually if linear separable

  • Non-Separable Data

    hold somewhat’best’weights in pocket