🌃 🍛 💇🏿 Synopsis on “Machine Learning”. Mathematical statistics. Maximum likelihood method 🐄 🤰 👁️

Recall some definitions of mathematical statistics.

(O m e g a, S i g m a, P)

$(\ Omega, \ Sigma, P)$ .

Definition 1:

Random variable

x i = x i (w)

$\ xi = \ xi (w)$ taking values in the set

S

$S$ c

s i g m a

$\ sigma$ -algebra of subsets

P h i

$\ Phi$ called any

(S i g m a, P h i)

$(\ Sigma, \ Phi)$ measurable function

x i c o l o n O m e g a t o S

$\ xi \ colon \ Omega \ to S$ , i.e

f o r a l l A s u b s e t e q S, A i n P h i

$\ forall A \ subseteq S, A \ in \ Phi$ the condition is satisfied

\ xi ^ {- 1} (A) = \ {\ omega \ in \ Omega \ space \ colon \ space \ xi (w) \ in A \} \ in \ Sigma

$\ xi ^ {- 1} (A) = \ {\ omega \ in \ Omega \ space \ colon \ space \ xi (w) \ in A \} \ in \ Sigma$ .

Definition 2:

The sample space is the space of all possible values of the observation or sample along with

s i g m a

$\ sigma$ -algebra of measurable subsets of this space.
Designation:

(B, m a t h s c r B)

$(B, \ mathscr {B})$ .

Defined on probability space

(O m e g a, S i g m a, P)

$(\ Omega, \ Sigma, P)$ random variables

x i, e t a, l d o t s c o l o n O m e g a t o B

$\ xi, \ eta, \ ldots \ colon \ Omega \ to B$ spawn in space

(B, m a t h s c r B)

$(B, \ mathscr {B})$ probabilistic measures

P_ \ xi \ {C \} = P \ {\ xi \ in C \}, P_ \ eta \ {C \} = P \ {\ eta \ in C \}, \ ldots

$P_ \ xi \ {C \} = P \ {\ xi \ in C \}, P_ \ eta \ {C \} = P \ {\ eta \ in C \}, \ ldots$ On a sample space, not one probability measure is determined, but a finite or infinite family of probability measures.

In problems of mathematical statistics , a family of probability measures is known.

\ {P_ \ theta, \ space \ theta \ in \ Theta \}

$\ {P_ \ theta, \ space \ theta \ in \ Theta \}$ defined in the sample space, and it is required to determine from the sample which of the probability measures of this family corresponds to the sample.

Definition 3:

A statistical model is an aggregate consisting of a sample space and a family of probability measures defined on it.

Designation:

(B, m a t h s c r B, m a t h s c r P)

$(B, \ mathscr {B}, \ mathscr {P})$ where

\ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \}

$\ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \}$ .

Let

B = m a t h b b R^{n}

$B = \ mathbb {R} ^ n$ and

(m a t h b b R^{n}, m a t h s c r B)

$(\ mathbb {R} ^ n, \ mathscr {B})$ - selective space.

Sampling

X = (x_{1}, l d o t s, x_{n})

$X = (x_1, \ ldots, x_n)$ can be considered as a combination

n

$n$ real numbers. We assign to each element of the sample a probability equal to

f r a c 1 n

$\ frac {1} {n}$ .

Let

I_{x} (B) = b e g i n c a s e s 1, q u a d x i n B 0, q u a d x n o t i n B e n d c a s e s

$I_x (B) = \ begin {cases} 1, \ quad x \ in B \\ 0, \ quad x \ not \ in B \ end {cases}$

Definition 4:

An empirical distribution constructed from sample X is a probability measure

P_{n}^{*}

$P_n ^ *$ :

P_{n}^{*} (B) = f r a c 1 n s u m_{k = 1}^{n} I_{x_{k}} (B)

$P_n ^ * (B) = \ frac {1} {n} \ sum_ {k = 1} ^ nI_ {x_k} (B)$

I.e

P_{n}^{*} (B)

$P_n ^ * (B)$ - the ratio of the number of sample elements that belong

B

$B$ , to the total number of sample items:

P_{n}^{*} (B) = f r a c n u_{n} (B) n, s p a c e n u_{n} (B) = s u m l i m i t s_{k = 1}^{n} I (x_{k} i n B), s p a c e B i n m a t h s c r B

$P_n ^ * (B) = \ frac {\ nu_n (B)} {n}, \ space \ nu_n (B) = \ sum \ limits_ {k = 1} ^ nI (x_k \ in B), \ space B \ in \ mathscr {B}$ .

Definition 5:

Selective moment order

k

$k$ called

h a t m_{k}^{*} = h a t m_{k}^{*} (X) = f r a c 1 n s u m_{j = 1}^{n} x_{j}^{k}

$\ hat {m} ^ * _ k = \ hat {m} ^ * _ k (X) = \ frac {1} {n} \ sum_ {j = 1} ^ nx_j ^ k$

h a t m_{1}^{*} = o v e r l i n e X = f r a c 1 n s u m l i m i t s_{j = 1}^{n} x_{j}

$\ hat {m} _1 ^ * = \ overline {X} = \ frac {1} {n} \ sum \ limits_ {j = 1} ^ n x_j$ - sample mean .

Definition 6:

Selective central moment of order

k

$k$ defined by equality

h a t m_{k}^{* (0)} = h a t m_{k}^{* (0)} (X) = f r a c 1 n s u m_{j = 1}^{n} (x_{j} - o v e r l i n e X)^{k}

$\ hat {m} _k ^ {* (0)} = \ hat {m} _k ^ {* (0)} (X) = \ frac {1} {n} \ sum_ {j = 1} ^ n ( x_j - \ overline {X}) ^ k$

S^{2} = S^{2} (X) = h a t m_{2}^{* (0)} = f r a c 1 n s u m l i m i t s_{j = 1}^{n} (x_{j} - o v e r l i n e X)^{2}

$S ^ 2 = S ^ 2 (X) = \ hat {m} _2 ^ {* (0)} = \ frac {1} {n} \ sum \ limits_ {j = 1} ^ n (x_j - \ overline {X}) ^ 2$ - sample variance .

In machine learning, many tasks are to learn how to select a parameter from the data available.

t h e t a

$\ theta$ which best describes this data. In mathematical statistics, the maximum likelihood method is often used to solve a similar problem.

In real life, the error distribution often has a normal distribution. For some justification, we state the central limit theorem .

Theorem 1 (CLT):

If random variables

x i_{1}, l d o t s, x i_{n}

$\ xi_1, \ ldots, \ xi_n$ - independent, equally distributed, mathematical expectation

M (x i_{i}) = a

$M (\ xi_i) = a$ variance

D (x i_{i}) = s i g m a^{2} i n (0, + i n f t y) s p a c e f o r a l l i i n o v e r l i n e 1, n

$D (\ xi_i) = \ sigma ^ 2 \ in (0, + \ infty) \ space \ forall i \ in \ overline {1, n}$ then

\ lim \ limits_ {n \ to \ infty} P \ {\ frac {\ xi_1 + \ xi_2 + \ ldots + \ xi_n - na} {\ sigma \ sqrt {n}} \ leq x \} = F (x) = \ frac {1} {\ sqrt {2 \ pi}} \ int \ limits _ {- \ infty} ^ xe ^ {- u ^ 2/2} du.

$\ lim \ limits_ {n \ to \ infty} P \ {\ frac {\ xi_1 + \ xi_2 + \ ldots + \ xi_n - na} {\ sigma \ sqrt {n}} \ leq x \} = F (x) = \ frac {1} {\ sqrt {2 \ pi}} \ int \ limits _ {- \ infty} ^ xe ^ {- u ^ 2/2} du.$

Below, we formulate the maximum likelihood method and consider its operation as an example of a family of normal distributions.

Maximum likelihood method

Let for a statistical model

(B, \ mathscr {B}, \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \})

$(B, \ mathscr {B}, \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \})$ two conditions are satisfied:

if a $\ theta_1 \ not = \ theta_2$ then $P _ {\ theta_1} \ not = P _ {\ theta_2}$ ;
there is such a measure $\ mu$ on $(B, \ mathscr {B})$ concerning which for any measure $P_ \ theta$ , $\ theta \ in \ Theta$ , there is a density $f_ \ theta (x) = \ frac {dP_ \ theta (x)} {d \ mu} (x)$ , i.e $\ forall C \ in \ mathscr {B} \ quad P_ \ theta (C) = \ int \ limits_Cf_ \ theta (x) \ mu (dx)$ .

Definition 7:

Maximum Likelihood Assessment (OMP)

h a t t h e t a

$\ hat {\ theta}$ parameter

t h e t a

$\ theta$ called empirically constructed

P_{n}^{*}

$P ^ * _ n$ corresponding to the sample

X = (x_{1}, l d o t s, x_{n})

$X = (x_1, \ ldots, x_n)$ , value

t h e t a i n T h e t a

$\ theta \ in \ Theta$ at which

m a x l i m i t s_{t h e t a i n T h e t a} i n t l n f_{} t h e t a (x) P_{n}^{*} (d x) = m a x l i m i t s_{t h e t a i n T h e t a} f r a c 1 n s u m l i m i t s_{i = 1}^{n} l n f_{} t h e t a (x) .

$\ max \ limits _ {\ theta \ in \ Theta} \ int \ ln f_ \ theta (x) P_n ^ * (dx) = \ max \ limits _ {\ theta \ in \ Theta} \ frac {1} {n} \ sum \ limits_ {i = 1} ^ n \ ln f_ \ theta (x).$

Definition 8:

Function

L a m b d a_{} t h e t a (X) = p r o d l i m i t s_{i = 1}^{n} f_{} t h e t a (x_{i})

$\ Lambda_ \ theta (X) = \ prod \ limits_ {i = 1} ^ n f_ \ theta (x_i)$ as a function of

t h e t a

$\ theta$ is called the likelihood function , and the function

L (X, t h e t a) = s u m l i m i t s_{i = 1}^{n} l n f_{} t h e t a (x_{i})

$L (X, \ theta) = \ sum \ limits_ {i = 1} ^ n \ ln f_ \ theta (x_i)$ - logarithmic likelihood function .

These functions peak at the same values.

t h e t a

$\ theta$ , as

l n x

$\ ln x$ - monotonous increasing function.

Example:

\ mathscr {P} = \ {N (a, \ sigma ^ 2) \ space | \ space a \ in \ mathbb {R}, \ space \ sigma \ in (0, + \ infty) \}

$\ mathscr {P} = \ {N (a, \ sigma ^ 2) \ space | \ space a \ in \ mathbb {R}, \ space \ sigma \ in (0, + \ infty) \}$ - family of normal distributions with densities

\ phi_ {a, \ sigma ^ 2} (x) = \ frac {1} {\ sigma \ sqrt {2 \ pi}} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} (xa ) ^ 2 \}

$\ phi_ {a, \ sigma ^ 2} (x) = \ frac {1} {\ sigma \ sqrt {2 \ pi}} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} (xa ) ^ 2 \}$ . By sample

X = (x_{1}, l d o t s, x_{n})

$X = (x_1, \ ldots, x_n)$

\ Lambda_ {a, \ sigma} (X) = \ frac {1} {(2 \ pi) ^ {\ frac {n} {2}} \ sigma ^ n} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ {i = 1} ^ n (x_j-a) ^ 2 \};

$\ Lambda_ {a, \ sigma} (X) = \ frac {1} {(2 \ pi) ^ {\ frac {n} {2}} \ sigma ^ n} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ {i = 1} ^ n (x_j-a) ^ 2 \};$

L (X, (a, s i g m a)) = - f r a c n 2 l n 2 p i - n l n s i g m a - f r a c 1 2 s i g m a^{2} s u m l i m i t s_{i = 1}^{n} (x_{i} - a)^{2};

$L (X, (a, \ sigma)) = - \ frac {n} {2} \ ln2 \ pi - n \ ln \ sigma - \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ { i = 1} ^ n (x_i-a) ^ 2;$

f r a c p a r t i a l L p a r t i a l a = f r a c 1 s i g m a^{2} s u m l i m i t s_{i = 1}^{n} (x_{i} - a), q u a d f r a c p a r t i a l L p a r t i a l s i g m a = - f r a c n s i g m a + f r a c 1 s i g m a^{3} s u m l i m i t s_{i = 1}^{n} (x_{i} - a)^{2};

$\ frac {\ partial L} {\ partial a} = \ frac {1} {\ sigma ^ 2} \ sum \ limits_ {i = 1} ^ n (x_i-a), \ quad \ frac {\ partial L } {\ partial \ sigma} = - \ frac {n} {\ sigma} + \ frac {1} {\ sigma ^ 3} \ sum \ limits_ {i = 1} ^ n (x_i-a) ^ 2;$

f r a c p a r t i a l L p a r t i a l a = 0 q u a d R i g h t a r r o w q u a d s u m l i m i t s_{i = 1}^{n} x_{i} - n a = 0 q u a d R i g h t a r r o w q u a d f r a c 1 n s u m l i m i t s_{i = 1}^{n} x_{i} = o v e r l i n e X = h a t a;

$\ frac {\ partial L} {\ partial a} = 0 \ quad \ Rightarrow \ quad \ sum \ limits_ {i = 1} ^ nx_i - na = 0 \ quad \ Rightarrow \ quad \ frac {1} {n} \ sum \ limits_ {i = 1} ^ nx_i = \ overline {X} = \ hat {a};$

f r a c p a r t i a l L p a r t i a l s i g m a = 0 q u a d R i g h t a r r o w q u a d f r a c n s i g m a = f r a c 1 s i g m a^{3} s u m l i m i t s_{i = 1}^{n} (x_{i} - a)^{2} q u a d R i g h t a r r o w q u a d h a t s i g m a = s q r t f r a c 1 n s u m l i m i t s_{i = 1}^{n} (x_{i} - o v e r l i n e X)^{2} = s q r t S^{2} .

$\ frac {\ partial L} {\ partial \ sigma} = 0 \ quad \ Rightarrow \ quad \ frac {n} {\ sigma} = \ frac {1} {\ sigma ^ 3} \ sum \ limits_ {i = 1} ^ n (x_i - a) ^ 2 \ quad \ Rightarrow \ quad \ hat {\ sigma} = \ sqrt {\ frac {1} {n} \ sum \ limits_ {i = 1} ^ n (x_i - \ overline {X}) ^ 2} = \ sqrt {S ^ 2}.$

Estimates for mathematical expectation and variance were obtained.

If you look closely at the formula

L (X, (a, s i g m a)) = - f r a c n 2 l n 2 p i - n l n s i g m a - f r a c 1 2 s i g m a^{2} s u m l i m i t s_{i = 1}^{n} (x_{i} - a)^{2}

$L (X, (a, \ sigma)) = - \ frac {n} {2} \ ln2 \ pi - n \ ln \ sigma - \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ { i = 1} ^ n (x_i-a) ^ 2$

we can conclude that the function

L (X, (a, s i g m a))

$L (X, (a, \ sigma))$ assumes its maximum value when

s u m l i m i t s_{i = 1}^{n} (x_{i} - a)^{2}

$\ sum \ limits_ {i = 1} ^ n (x_i-a) ^ 2$ is minimal. In machine learning problems, the least squares method is often used, in which the sum of the squared deviations of the predicted values from the true ones is minimized.

Bibliography:

Lecture notes on mathematical statistics, author unknown;
“Deep learning. Immersion in the world of neural networks ”, S. Nikulenko, A. Kadurin, E. Arkhangelskaya, PETER, 2018.

Synopsis on “Machine Learning”. Mathematical statistics. Maximum likelihood method

Recall some definitions of mathematical statistics.

Maximum likelihood method

Bibliography:

More articles: