💵 🤘🏻 👏🏽 Synopsis on “Machine Learning”. Mathematical analysis. Gradient descent 😟 ✊🏼 🤟🏼

Recall the mathematical analysis

Function Continuity and Derivative

Let

E s u b s e t e q m a t h b b R

$E \ subseteq \ mathbb {R}$ ,

a

$a$ Is the limit point of the set

E

$E$ (those.

a i n E, f o r a l l v a r e p s i l o n > 0 s p a c e s p a c e | (a - v a r e p s i l o n, a + v a r e p s i l o n) c a p E | = i n f t y

$a \ in E, \ forall \ varepsilon> 0 \ space \ space | (a - \ varepsilon, a + \ varepsilon) \ cap E | = \ infty$ ),

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ .

Definition 1 (Cauchy function limit):

Function

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ committed to

A

$A$ at

x

$x$ seeking to

a

$a$ , if a

f o r a l l v a r e p s i l o n > 0 s p a c e s p a c e e x i s t s d e l t a > 0 s p a c e s p a c e f o r a l l x i n E s p a c e s p a c e (0 < | x - a | < d e l t a R i g h t a r r o w | f (x) - A | < v a r e p s i l o n) .

$\ forall \ varepsilon> 0 \ space \ space \ exists \ delta> 0 \ space \ space \ forall x \ in E \ space \ space (0 <| x- a | <\ delta \ Rightarrow | f (x) - A | <\ varepsilon).$

Designation:

l i m l i m i t s_{E n i x t o a} f (x) = A

$\ lim \ limits_ {E \ ni x \ to a} f (x) = A$ .

Definition 2:

Interval $ab$ called set $] a, b [\ space: = \ {x \ in \ mathbb {R} | a <x <b \}$ ;
Point Interval $x \ in \ mathbb {R}$ is called the neighborhood of this point.
A punctured neighborhood of a point is a neighborhood of a point from which this point itself is excluded.

Designation:

$V (x)$ or $U (x)$ - neighborhood of a point $x$ ;
$\ overset {\ circ} {U} (x)$ - punctured neighborhood of a point $x$ ;
$U_E (x): = E \ cap U (x), \\ \ overset {\ circ} {U} _E (x): = E \ cap \ overset {\ circ} {U} (x)$

Definition 3 (function limit through neighborhoods):

l i m l i m i t s_{E n i x t o a} f (x) = A := f o r a l l V_{R} (A) s p a c e e x i s t s o v e r s e t c i r c U_{E} (a) s p a c e s p a c e (f (o v e r s e t c i r c U_{E} (a)) s u b s e t V_{R} (A)) .

$\ lim \ limits_ {E \ ni x \ to a} f (x) = A: = \ forall V_R (A) \ space \ exists \ overset {\ circ} {U} _E (a) \ space \ space ( f (\ overset {\ circ} {U} _E (a)) \ subset V_R (A)).$

Definitions 1 and 3 are equivalent.

Definition 4 (continuity of a function at a point):

$f \ colon E \ to \ mathbb {R}$ continuous in $a \ in E: =$
$= f o r a l l V (f (a)) s p a c e s p a c e e x i s t s U_{E} (a) s p a c e s p a c e (f (U_{E} (a)) s u b s e t V (f (a)));$
$= \ forall V (f (a)) \ space \ space \ exists U_E (a) \ space \ space (f (U_E (a)) \ subset V (f (a)));$
$f \ colon E \ to \ mathbb {R}$ continuous in $a \ in E: =$
$f o r a l l v a r e p s i l o n > 0 s p a c e s p a c e e x i s t s d e l t a > 0 s p a c e s p a c e f o r a l l x i n E s p a c e s p a c e (| x a | < d e l t a R i g h t a r r o w | f (x) - f (a) | < v a r e p s i l o n) .$
$\ forall \ varepsilon> 0 \ space \ space \ exists \ delta> 0 \ space \ space \ forall x \ in E \ space \ space (| xa | <\ delta \ Rightarrow | f (x) -f (a) | <\ varepsilon).$

Definitions 3 and 4 show that
(

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ continuous in

a i n E

$a \ in E$ where

a

$a$ - limit point

E

$E$ )

L e f t r i g h t a r r o w

$\ Leftrightarrow$

L e f t r i g h t a r r o w (l i m l i m i t s_{E n i x t o a} f (x) = f (a)) .

$\ Leftrightarrow (\ lim \ limits_ {E \ ni x \ to a} f (x) = f (a)).$

Definition 5:

Function

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ called continuous on the set

E

$E$ if it is continuous at each point of the set

E

$E$ .

Definition 6:

Function $f \ colon E \ to \ mathbb {R}$ defined on the set $E \ subset \ mathbb {R}$ is called differentiable at the point $a \ in E$ limiting for the set $E$ if there is such a linear with respect to the increment $x-a$ argument function $A \ cdot (x-a)$ [function differential $f$ at the point $a$ ] that increment $f (x) -f (a)$ the functions $f$ represented as
$f (x) - f (a) = A c d o t (x - a) + o (x - a) q u a d f o r s p a c e x t o a, s p a c e x i n E .$
$f (x) -f (a) = A \ cdot (x-a) + o (x-a) \ quad for \ space x \ to a, \ space x \ in E.$
Value
$f^{'} (a) = l i m l i m i t s_{E n i x t o a} f r a c f (x) - f (a) x - a$
$f '(a) = \ lim \ limits_ {E \ ni x \ to a} \ frac {f (x) -f (a)} {x-a}$

called derivative function $f$ at the point $a$ .

Also

f^{'} (x) = l i m_{s u b s t a c k h t o 0 x + h, x i n E} f r a c f (x + h) - f (x) h .

$f '(x) = \ lim _ {\ substack {h \ to 0 \\ x + h, x \ in E}} \ frac {f (x + h) -f (x)} {h}.$

Definition 7:

Dot $x_0 \ in E \ subset \ mathbb {R}$ is called the point of local maximum (minimum) , and the value of the function in it is called the local maximum (minimum) of the function $f \ colon E \ to \ mathbb {R}$ , if a $\ exists U_E (x_0)$ :
$f o r a l l x i n U_{E} (x_{0}) s p a c e s p a c e f (x) l e q f (x_{0}) (r e s p e c t i v e l y, f (x) g e q f (x_{0})) .$
$\ forall x \ in U_E (x_0) \ space \ space f (x) \ leq f (x_0) (respectively, f (x) \ geq f (x_0)).$
The points of local maximum and minimum are called points of local extremum , and the values of the function in them are called local extrema of the function .
Dot $x_0 \ in E$ extremum function $f \ colon E \ to \ mathbb {R}$ called an internal extremum point if $x_0$ is the limit point as for the set $E _- = \ {x \ in E | x <x_0 \}$ , and for the set $E _ + = \ {x \ in E | x> x_0 \}$ .

Lemma 1 (Fermat):

If the function

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ differentiable at the point of internal extremum

x_{0} i n E

$x_0 \ in E$ , then its derivative at this point is zero:

f^{'} (x_{0}) = 0

$f '(x_0) = 0$ .

Proposition 1 (Roll's theorem):
If the function

f c o l o n [a, b] t o m a t h b b R

$f \ colon [a, b] \ to \ mathbb {R}$ continuous on a segment

[a, b]

$[a, b]$ differentiable in the interval

] a, b [

$] a, b [$ and

f (a) = f (b)

$f (a) = f (b)$ then there is a point

x i i n] a, b [

$\ xi \ in] a, b [$ such that

f^{'} (x i) = 0

$f '(\ xi) = 0$ .

Theorem 1 (Lagrange finite increment theorem):

If the function

f c o l o n [a, b] t o m a t h b b R

$f \ colon [a, b] \ to \ mathbb {R}$ continuous on a segment

[a, b]

$[a, b]$ and differentiable in the interval

] a, b [

$] a, b [$ then there is a point

x i i n] a, b [

$\ xi \ in] a, b [$ such that

f (b) - f (a) = f^{'} (x i) (b - a) .

$f (b) -f (a) = f '(\ xi) (b-a).$

Corollary 1 (a sign of monotonicity of a function):
If at any point of a certain interval the derivative of the function is non-negative (positive), then the function does not decrease (increases) on this interval.

Corollary 2 (criterion for the constancy of a function):
Continuous on a cut

[a, b]

$[a, b]$ a function is not constant if and only if its derivative is zero at any point in the interval

[a, b]

$[a, b]$ (or at least the interval

] a, b [

$] a, b [$ )

Partial derivative of a function of many variables

Across

m a t h b b R^{m}

$\ mathbb {R} ^ m$ denote the set:

\ mathbb {R} ^ m = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ m = \ {(\ omega_1, \ omega_2, ... , \ omega_m), \ space \ omega_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, m} \}.

$\ mathbb {R} ^ m = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ m = \ {(\ omega_1, \ omega_2, ... , \ omega_m), \ space \ omega_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, m} \}.$

Definition 8:

Function

f c o l o n E t o m a t h b b R

$f \ colon E \ to \ mathbb {R}$ defined on the set

E s u b s e t m a t h b b R^{m}

$E \ subset \ mathbb {R} ^ m$ is called differentiable at the point

x i n E

$x \ in E$ limiting for the set

E

$E$ , if a

f (x + h) - f (x) = L (x) h + a l p h a (x; h), q q u a d (1)

$f (x + h) -f (x) = L (x) h + \ alpha (x; h), \ qquad (1)$

Where

L (x) c o l o n m a t h b b R^{m} t o m a t h b b R

$L (x) \ colon \ mathbb {R} ^ m \ to \ mathbb {R}$ - linear with respect to

h

$h$ function [function differential

f

$f$ at the point

x

$x$ (reference

d f (x)

$df (x)$ or

f^{'} (x)

$f '(x)$ )], but

a l p h a (x; h) = o (h)

$\ alpha (x; h) = o (h)$ at

h t o 0, x + h i n E

$h \ to 0, x + h \ in E$ .

Relation (1) can be rewritten as follows:

f (x + h) - f (x) = f^{'} (x) h + a l p h a (x; h)

$f (x + h) -f (x) = f '(x) h + \ alpha (x; h)$

b i g t r i a n g l e u p f (x; h) = d f (x) h + a l p h a (x; h) .

$\ bigtriangleup f (x; h) = df (x) h + \ alpha (x; h).$

If we go to the coordinate record of the point

x = (x^{1}, . . ., x^{m})

$x = (x ^ 1, ..., x ^ m)$ , vector

h = (h^{1}, . . ., h^{m})

$h = (h ^ 1, ..., h ^ m)$ and linear function

L (x) h = a_{1} (x) h^{1} + . . . + a_{m} (x) h^{m}

$L (x) h = a_1 (x) h ^ 1 + ... + a_m (x) h ^ m$ , then equality (1) looks like this

f (x^{1} + h^{1}, . . ., x^{m} + h^{m}) - f (x^{1}, . . ., x^{m}) = = a_{1} (x) h^{1} + . . . + a_{m} (x) h^{m} + o (h) q u a d f o r s p a c e s p a c e h t o 0, q q u a d (2)

$f (x ^ 1 + h ^ 1, ..., x ^ m + h ^ m) -f (x ^ 1, ..., x ^ m) = \\ = a_1 (x) h ^ 1 + ... + a_m (x) h ^ m + o (h) \ quad for \ space \ space h \ to 0, \ qquad (2)$

Where

a_{1} (x), . . ., a_{m} (x)

$a_1 (x), ..., a_m (x)$ - associated with point

x

$x$ real numbers. You need to find these numbers.

We denote

h_{i} = h^{i} e_{i} = 0 c d o t e_{1} + . . . + 0 c d o t e_{i - 1} + h^{i} c d o t e_{i} + 0 c d o t e_{i + 1} + . . . + 0 c d o t e_{m},

$h_i = h ^ ie_i = 0 \ cdot e_1 + ... + 0 \ cdot e_ {i-1} + h ^ i \ cdot e_i + 0 \ cdot e_ {i + 1} + ... + 0 \ cdot e_m,$

Where

\ {e_1, ..., e_m \}

$\ {e_1, ..., e_m \}$ - basis in

m a t h b b R^{m}

$\ mathbb {R} ^ m$ .

At

h = h_{i}

$h = h_i$ from (2) we obtain

f (x^{1}, . . ., x^{i - 1}, x^{i} + h^{i}, x^{i + 1}, . . ., x^{m}) - f (x^{1}, . . ., x^{i}, . . ., x^{m}) = = a_{i} (x) h^{i} + o (h^{i}) q u a d f o r s p a c e s p a c e h^{i} t o 0. q q u a d (3)

$f (x ^ 1, ..., x ^ {i-1}, x ^ i + h ^ i, x ^ {i + 1}, ..., x ^ m) -f (x ^ 1, ..., x ^ i, ..., x ^ m) = \\ = a_i (x) h ^ i + o (h ^ i) \ quad for \ space \ space h ^ i \ to 0. \ qquad (3)$

From (3) we obtain

a_{i} (x) = l i m_{h_{i} t o 0} f r a c f (x^{1}, . . ., x^{i - 1}, x^{i} + h^{i}, x^{i + 1}, . ., x^{m}) - f (x^{1}, . . ., x^{i}, . . ., x^{m}) h^{i} . q q u a d (4)

$a_i (x) = \ lim_ {h_i \ to 0} \ frac {f (x ^ 1, ..., x ^ {i-1}, x ^ i + h ^ i, x ^ {i + 1} , .., x ^ m) -f (x ^ 1, ..., x ^ i, ..., x ^ m)} {h ^ i}. \ qquad (4)$

Definition 9:
The limit (4) is called the partial derivative of the function

f (x)

$f (x)$ at the point

x = (x^{1}, . . ., x^{m})

$x = (x ^ 1, ..., x ^ m)$ by variable

x^{i}

$x ^ i$ . It is designated:

f r a c p a r t i a l f p a r t i a l x^{i} (x), q u a d p a r t i a l_{i} f (x), q u a d f_{x^{i}}^{'} (x) .

$\ frac {\ partial f} {\ partial x ^ i} (x), \ quad \ partial_if (x), \ quad f '_ {x ^ i} (x).$

Example 1:

f (u, v) = u^{3} + v^{2} s i n u, p a r t i a l_{1} f (u, v) = f r a c p a r t i a l f p a r t i a l u (u, v) = 3 u^{2} + v^{2} c o s u, p a r t i a l_{2} f (u, v) = f r a c p a r t i a l f p a r t i a l v (u, v) = 2 v s i n u .

$f (u, v) = u ^ 3 + v ^ 2 \ sin u, \\ \ partial_1f (u, v) = \ frac {\ partial f} {\ partial u} (u, v) = 3u ^ 2 + v ^ 2 \ cos u, \\ \ partial_2 f (u, v) = \ frac {\ partial f} {\ partial v} (u, v) = 2v \ sin u.$

Gradient descent

Let

f c o l o n m a t h b b R^{n} t o m a t h b b R

$f \ colon \ mathbb {R} ^ n \ to \ mathbb {R}$ where

\ mathbb {R} ^ n = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ n = \ {(\ theta_1, \ theta_2, ... , \ theta_n), \ space \ theta_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, n} \}

$\ mathbb {R} ^ n = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ n = \ {(\ theta_1, \ theta_2, ... , \ theta_n), \ space \ theta_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, n} \}$ .

Definition 10:

Gradient Function

f c o l o n m a t h b b R^{n} t o m a t h b b R

$f \ colon \ mathbb {R} ^ n \ to \ mathbb {R}$ called a vector,

i

$i$ whose element is equal to

f r a c p a r t i a l f p a r t i a l t h e t a_{i}

$\ frac {\ partial f} {\ partial \ theta_i}$ :

b i g t r i a n g l e d o w n_{t h e t a} f = l e f t (b e g i n a r r a y c f r a c p a r t i a l f p a r t i a l t h e t a_{1} f r a c p a r t i a l f p a r t i a l t h e t a_{2} v d o t s f r a c p a r t i a l f p a r t i a l t h e t a_{n} e n d a r r a y r i g h t), q u a d t h e t a = (t h e t a_{1}, t h e t a_{2}, . . ., t h e t a_{n}) .

$\ bigtriangledown _ {\ theta} f = \ left (\ begin {array} {c} \ frac {\ partial f} {\ partial \ theta_1} \\\ frac {\ partial f} {\ partial \ theta_2} \\ \ vdots \\\ frac {\ partial f} {\ partial \ theta_n} \ end {array} \ right), \ quad \ theta = (\ theta_1, \ theta_2, ..., \ theta_n).$

Gradient is the direction in which the function increases most rapidly. This means that the direction in which it decreases most rapidly is the direction opposite to the gradient, i.e.

- b i g t r i a n g l e d o w n_{t h e t a} f

$- \ bigtriangledown _ {\ theta} f$ .

The aim of the gradient descent method is to find the extremum (minimum) point of the function.

Denote by

t h e t a^{(t)}

$\ theta ^ {(t)}$ function parameter vector in step

t

$t$ . Parameter update vector in step

t

$t$ :

u^{(t)} = - e t a b i g t r i a n g l e d o w n_{t h e t a} f (t h e t a^{(t - 1)}), q u a d t h e t a^{(t)} = t h e t a^{(t - 1)} + u^{(t)} .

$u ^ {(t)} = - \ eta \ bigtriangledown _ {\ theta} f (\ theta ^ {(t-1)}), \ quad \ theta ^ {(t)} = \ theta ^ {(t- 1)} + u ^ {(t)}.$

In the formula above, the parameter

e t a

$\ eta$ Is the learning speed that controls the size of the step that we take in the direction of the gradient slope. In particular, two opposing problems may arise:

if the steps are too small, the training will be too long, and the likelihood of getting stuck in a small unsuccessful local minimum along the road increases (the first image in the picture below);
if they are too large, you can endlessly jump over the desired minimum back and forth, but never reach the lowest point (the third image in the picture below).

Example:
Consider the example of the gradient descent method in the simplest case (

n = 1

$n = 1$ ) I.e

f c o l o n m a t h b b R t o m a t h b b R

$f \ colon \ mathbb {R} \ to \ mathbb {R}$ .
Let

f (x) = x^{2}, q u a d t h e t a^{(0)} = 3, q u a d e t a = 1

$f (x) = x ^ 2, \ quad \ theta ^ {(0)} = 3, \ quad \ eta = 1$ . Then:

f r a c p a r t i a l f p a r t i a l x (x) = 2 x q u a d R i g h t a r r o w q u a d b i g t r i a n g l e d o w n f_{} t h e t a (x) = 2 x; t h e t a^{(1)} = t h e t a^{(0)} - 1 c d o t f_{} t h e t a (t h e t a^{(0)}) = 3 - 6 = - 3; t h e t a^{(2)} = t h e t a^{(1)} - 1 c d o t f_{} t h e t a (t h e t a^{(1)}) = - 3 + 6 = 3 = t h e t a^{(0)} .

$\ frac {\ partial f} {\ partial x} (x) = 2x \ quad \ Rightarrow \ quad \ bigtriangledown f_ \ theta (x) = 2x; \\ \ theta ^ {(1)} = \ theta ^ {(0)} - 1 \ cdot f_ \ theta (\ theta ^ {(0)}) = 3 - 6 = -3; \\ \ theta ^ {(2)} = \ theta ^ {(1)} - 1 \ cdot f_ \ theta (\ theta ^ {(1)}) = - 3 + 6 = 3 = \ theta ^ {(0 )}.$

In the case when

e t a = 1

$\ eta = 1$ , the situation turns out, as in the third image of the picture above. We constantly jump over the extreme point.
Let

e t a = 0.8

$\ eta = 0.8$ . Then:

t h e t a^{(1)} = t h e t a^{(0)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(0)}) = 3 - 0.8 t i m e s 6 = 3 - 4.8 = - 1.8; t h e t a^{(2)} = t h e t a^{(1)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(1)}) = - 1.8 + 0.8 t i m e s 3.6 = - 1.8 + 2.88 = 1.08; t h e t a^{(3)} = t h e t a^{(2)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(2)}) = 1.08 - 0.8 t i m e s 2.16 = 1.08 - 1.728 = - 0.648; t h e t a^{(4)} = t h e t a^{(3)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(3)}) = - 0.648 + 0.8 t i m e s 1.296 = - 0.648 + 1.0368 = 0.3888; t h e t a^{(5)} = t h e t a^{(4)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(4)}) = 0.3888 - 0.8 t i m e s 0.7776 = 0.3888 - .62208 = - 0.23328; t h e t a^{(6)} = t h e t a^{(5)} - 0.8 t i m e s f_{} t h e t a (t h e t a^{(5)}) = - 0.23328 + 0.8 t i m e s 0.46656 = - 0.23328 + 0.373248 = = 0.139968.

$\ theta ^ {(1)} = \ theta ^ {(0)} - 0.8 \ times f_ \ theta (\ theta ^ {(0)}) = 3 - 0.8 \ times6 = 3 - 4.8 = -1.8; \\ \ theta ^ {(2)} = \ theta ^ {(1)} - 0.8 \ times f_ \ theta (\ theta ^ {(1)}) = - 1.8 + 0.8 \ times3.6 = -1.8 + 2.88 = 1.08; \\ \ theta ^ {(3)} = \ theta ^ {(2)} - 0.8 \ times f_ \ theta (\ theta ^ {(2)}) = 1.08 - 0.8 \ times2.16 = 1.08 - 1.728 = - 0.648; \\ \ theta ^ {(4)} = \ theta ^ {(3)} - 0.8 \ times f_ \ theta (\ theta ^ {(3)}) = - 0.648 + 0.8 \ times1.296 = -0.648 + 1.0368 = 0.3888; \\ \ theta ^ {(5)} = \ theta ^ {(4)} - 0.8 \ times f_ \ theta (\ theta ^ {(4)}) = 0.3888 - 0.8 \ times0.7776 = 0.3888 - .62208 = -0.23328; \\ \ theta ^ {(6)} = \ theta ^ {(5)} - 0.8 \ times f_ \ theta (\ theta ^ {(5)}) = - 0.23328 + 0.8 \ times0.46656 = -0.23328 + 0.373248 = \\ = 0.139968.$

It is seen that iteratively we are approaching the point of extremum.
Let

e t a = 0.5

$\ eta = 0.5$ . Then:

t h e t a^{(1)} = t h e t a^{(0)} - 0.5 t i m e s f_{} t h e t a (t h e t a^{(0)}) = 3 - 0.5 t i m e s 6 = 3 - 3 = 0; t h e t a^{(2)} = t h e t a^{(1)} - 0.5 t i m e s f_{} t h e t a (t h e t a^{(1)}) = 0 - 0.5 t i m e s 0 = 0.

$\ theta ^ {(1)} = \ theta ^ {(0)} - 0.5 \ times f_ \ theta (\ theta ^ {(0)}) = 3 - 0.5 \ times6 = 3 - 3 = 0; \\ \ theta ^ {(2)} = \ theta ^ {(1)} - 0.5 \ times f_ \ theta (\ theta ^ {(1)}) = 0 - 0.5 \ times0 = 0.$

The extremum point was found in 1 step.

Bibliography:

"Mathematical analysis. Part 1 ", V.A. Zorich, Moscow, 1997;
“Deep learning. Immersion in the world of neural networks ”, S. Nikulenko, A. Kadurin, E. Arkhangelskaya, PETER, 2018.

Synopsis on “Machine Learning”. Mathematical analysis. Gradient descent

Recall the mathematical analysis

Function Continuity and Derivative

Partial derivative of a function of many variables

Gradient descent

Bibliography:

More articles: