Synopsis on “Machine Learning”. Mathematical analysis. Gradient descent



Recall the mathematical analysis


Function Continuity and Derivative


Let E subseteq mathbbR, aIs the limit point of the set E(those. a inE, forall varepsilon>0 space space|(a varepsilon,a+ varepsilon) capE|= infty), f colonE to mathbbR.

Definition 1 (Cauchy function limit):

Function f colonE to mathbbRcommitted to Aat xseeking to a, if a

 forall varepsilon>0 space space exists delta>0 space space forallx inE space space(0<|xa|< delta Rightarrow|f(x)A|< varepsilon).


Designation:  lim limitsE nix toaf(x)=A.

Definition 2:

  1. Interval abcalled set ] a, b [\ space: = \ {x \ in \ mathbb {R} | a <x <b \} ;
  2. Point Interval x in mathbbRis called the neighborhood of this point.
  3. A punctured neighborhood of a point is a neighborhood of a point from which this point itself is excluded.

Designation:

  1. V(x)or U(x)- neighborhood of a point x;
  2.  overset circU(x)- punctured neighborhood of a point x;
  3. UE(x):=E capU(x), overset circUE(x):=E cap overset circU(x)

Definition 3 (function limit through neighborhoods):


 lim limitsE nix toaf(x)=A:= forallVR(A) space exists overset circUE(a) space space(f( overset circUE(a)) subsetVR(A)).


Definitions 1 and 3 are equivalent.

Definition 4 (continuity of a function at a point):

  1. f colonE to mathbbRcontinuous in a inE:=

    = forallV(f(a)) space space existsUE(a) space space(f(UE(a)) subsetV(f(a)));
  2. f colonE to mathbbRcontinuous in a inE:=

     forall varepsilon>0 space space exists delta>0 space space forallx inE space space(|xa|< delta Rightarrow|f(x)f(a)|< varepsilon).

Definitions 3 and 4 show that
( f colonE to mathbbRcontinuous in a inEwhere a- limit point E)  Leftrightarrow
 Leftrightarrow( lim limitsE nix toaf(x)=f(a)).

Definition 5:

Function f colonE to mathbbRcalled continuous on the set Eif it is continuous at each point of the set E.

Definition 6:

  1. Function f colonE to mathbbRdefined on the set E subset mathbbRis called differentiable at the point a inElimiting for the set Eif there is such a linear with respect to the increment xaargument function A cdot(xa)[function differential fat the point a] that increment f(x)f(a)the functions frepresented as

    f(x)f(a)=A cdot(xa)+o(xa) quadfor spacex toa, spacex inE.
  2. Value

    f(a)= lim limitsE nix toa fracf(x)f(a)xa


    called derivative function fat the point a.

Also

f(x)= lim substackh to0x+h,x inE fracf(x+h)f(x)h.



Definition 7:

  1. Dot x0 inE subset mathbbRis called the point of local maximum (minimum) , and the value of the function in it is called the local maximum (minimum) of the function f colonE to mathbbR, if a  existsUE(x0):

     forallx inUE(x0) space spacef(x) leqf(x0)(respectively,f(x) geqf(x0)).
  2. The points of local maximum and minimum are called points of local extremum , and the values ​​of the function in them are called local extrema of the function .
  3. Dot x0 inEextremum function f colonE to mathbbRcalled an internal extremum point if x0is the limit point as for the set E _- = \ {x \ in E | x <x_0 \} , and for the set E _ + = \ {x \ in E | x> x_0 \} .

Lemma 1 (Fermat):

If the function f colonE to mathbbRdifferentiable at the point of internal extremum x0 inE, then its derivative at this point is zero: f(x0)=0.

Proposition 1 (Roll's theorem):
If the function f colon[a,b] to mathbbRcontinuous on a segment [a,b]differentiable in the interval ]a,b[and f(a)=f(b)then there is a point  xi in]a,b[such that f( xi)=0.

Theorem 1 (Lagrange finite increment theorem):

If the function f colon[a,b] to mathbbRcontinuous on a segment [a,b]and differentiable in the interval ]a,b[then there is a point  xi in]a,b[such that

f(b)f(a)=f( xi)(ba).


Corollary 1 (a sign of monotonicity of a function):
If at any point of a certain interval the derivative of the function is non-negative (positive), then the function does not decrease (increases) on this interval.

Corollary 2 (criterion for the constancy of a function):
Continuous on a cut [a,b]a function is not constant if and only if its derivative is zero at any point in the interval [a,b](or at least the interval ]a,b[)

Partial derivative of a function of many variables


Across  mathbbRmdenote the set:

\ mathbb {R} ^ m = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ m = \ {(\ omega_1, \ omega_2, ... , \ omega_m), \ space \ omega_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, m} \}.



Definition 8:

Function f colonE to mathbbRdefined on the set E subset mathbbRmis called differentiable at the point x inElimiting for the set E, if a

f(x+h)f(x)=L(x)h+ alpha(x;h), qquad(1)

Where L(x) colon mathbbRm to mathbbR- linear with respect to hfunction [function differential fat the point x(reference df(x)or f(x))], but  alpha(x;h)=o(h)at h to0,x+h inE.

Relation (1) can be rewritten as follows:

f(x+h)f(x)=f(x)h+ alpha(x;h)

or

 bigtriangleupf(x;h)=df(x)h+ alpha(x;h).


If we go to the coordinate record of the point x=(x1,...,xm), vector h=(h1,...,hm)and linear function L(x)h=a1(x)h1+...+am(x)hm, then equality (1) looks like this

f(x1+h1,...,xm+hm)f(x1,...,xm)==a1(x)h1+...+am(x)hm+o(h) quadfor space spaceh to0, qquad(2)

Where a1(x),...,am(x)- associated with point xreal numbers. You need to find these numbers.

We denote

hi=hiei=0 cdote1+...+0 cdotei1+hi cdotei+0 cdotei+1+...+0 cdotem,

Where \ {e_1, ..., e_m \} - basis in  mathbbRm.

At h=hifrom (2) we obtain

f(x1,...,xi1,xi+hi,xi+1,...,xm)f(x1,...,xi,...,xm)==ai(x)hi+o(hi) quadfor space spacehi to0. qquad(3)



From (3) we obtain

ai(x)= limhi to0 fracf(x1,...,xi1,xi+hi,xi+1,..,xm)f(x1,...,xi,...,xm)hi. qquad(4)


Definition 9:
The limit (4) is called the partial derivative of the function f(x)at the point x=(x1,...,xm)by variable xi. It is designated:

 frac partialf partialxi(x), quad partialif(x), quadfxi(x).



Example 1:

f(u,v)=u3+v2 sinu, partial1f(u,v)= frac partialf partialu(u,v)=3u2+v2 cosu, partial2f(u,v)= frac partialf partialv(u,v)=2v sinu.





Gradient descent


Let f colon mathbbRn to mathbbRwhere \ mathbb {R} ^ n = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ n = \ {(\ theta_1, \ theta_2, ... , \ theta_n), \ space \ theta_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, n} \} .

Definition 10:

Gradient Function f colon mathbbRn to mathbbRcalled a vector, iwhose element is equal to  frac partialf partial thetai:

 bigtriangledown thetaf= left( beginarrayc frac partialf partial theta1 frac partialf partial theta2 vdots frac partialf partial thetan endarray right), quad theta=( theta1, theta2,..., thetan).


Gradient is the direction in which the function increases most rapidly. This means that the direction in which it decreases most rapidly is the direction opposite to the gradient, i.e.  bigtriangledown thetaf.

The aim of the gradient descent method is to find the extremum (minimum) point of the function.

Denote by  theta(t)function parameter vector in step t. Parameter update vector in step t:

u(t)= eta bigtriangledown thetaf( theta(t1)), quad theta(t)= theta(t1)+u(t).


In the formula above, the parameter  etaIs the learning speed that controls the size of the step that we take in the direction of the gradient slope. In particular, two opposing problems may arise:



Example:
Consider the example of the gradient descent method in the simplest case ( n=1) I.e f colon mathbbR to mathbbR.
Let f(x)=x2, quad theta(0)=3, quad eta=1. Then:

 frac partialf partialx(x)=2x quad Rightarrow quad bigtriangledownf theta(x)=2x; theta(1)= theta(0)1 cdotf theta( theta(0))=36=3; theta(2)= theta(1)1 cdotf theta( theta(1))=3+6=3= theta(0).

In the case when  eta=1, the situation turns out, as in the third image of the picture above. We constantly jump over the extreme point.
Let  eta=0.8. Then:

 theta(1)= theta(0)0.8 timesf theta( theta(0))=30.8 times6=34.8=1.8; theta(2)= theta(1)0.8 timesf theta( theta(1))=1.8+0.8 times3.6=1.8+2.88=1.08; theta(3)= theta(2)0.8 timesf theta( theta(2))=1.080.8 times2.16=1.081.728=0.648; theta(4)= theta(3)0.8 timesf theta( theta(3))=0.648+0.8 times1.296=0.648+1.0368=0.3888; theta(5)= theta(4)0.8 timesf theta( theta(4))=0.38880.8 times0.7776=0.3888.62208=0.23328; theta(6)= theta(5)0.8 timesf theta( theta(5))=0.23328+0.8 times0.46656=0.23328+0.373248==0.139968.

It is seen that iteratively we are approaching the point of extremum.
Let  eta=0.5. Then:

 theta(1)= theta(0)0.5 timesf theta( theta(0))=30.5 times6=33=0; theta(2)= theta(1)0.5 timesf theta( theta(1))=00.5 times0=0.

The extremum point was found in 1 step.

Bibliography:



Source: https://habr.com/ru/post/474338/


All Articles