Synopsis on “Machine Learning”. Mathematical statistics. Maximum likelihood method



Recall some definitions of mathematical statistics.


Let a probability space be given ( Omega, Sigma,P).

Definition 1:

Random variable  xi= xi(w)taking values ​​in the set Sc  sigma-algebra of subsets  Phicalled any ( Sigma, Phi)measurable function  xi colon Omega toS, i.e  forallA subseteqS,A in Phithe condition is satisfied \ xi ^ {- 1} (A) = \ {\ omega \ in \ Omega \ space \ colon \ space \ xi (w) \ in A \} \ in \ Sigma .

Definition 2:

The sample space is the space of all possible values ​​of the observation or sample along with  sigma-algebra of measurable subsets of this space.
Designation: (B, mathscrB).

Defined on probability space ( Omega, Sigma,P)random variables  xi, eta, ldots colon Omega toBspawn in space (B, mathscrB)probabilistic measures P_ \ xi \ {C \} = P \ {\ xi \ in C \}, P_ \ eta \ {C \} = P \ {\ eta \ in C \}, \ ldots On a sample space, not one probability measure is determined, but a finite or infinite family of probability measures.

In problems of mathematical statistics , a family of probability measures is known. \ {P_ \ theta, \ space \ theta \ in \ Theta \} defined in the sample space, and it is required to determine from the sample which of the probability measures of this family corresponds to the sample.

Definition 3:

A statistical model is an aggregate consisting of a sample space and a family of probability measures defined on it.

Designation: (B, mathscrB, mathscrP)where \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \} .

Let B= mathbbRnand ( mathbbRn, mathscrB)- selective space.

Sampling X=(x1, ldots,xn)can be considered as a combination nreal numbers. We assign to each element of the sample a probability equal to  frac1n.

Let

Ix(B)= begincases1, quadx inB0, quadx not inB endcases


Definition 4:

An empirical distribution constructed from sample X is a probability measure Pn:

Pn(B)= frac1n sumk=1nIxk(B)


I.e Pn(B)- the ratio of the number of sample elements that belong B, to the total number of sample items: Pn(B)= frac nun(B)n, space nun(B)= sum limitsk=1nI(xk inB), spaceB in mathscrB.

Definition 5:

Selective moment order kcalled

 hatmk= hatmk(X)= frac1n sumj=1nxjk

 hatm1= overlineX= frac1n sum limitsj=1nxj- sample mean .

Definition 6:

Selective central moment of order kdefined by equality

 hatmk(0)= hatmk(0)(X)= frac1n sumj=1n(xj overlineX)k

S2=S2(X)= hatm2(0)= frac1n sum limitsj=1n(xj overlineX)2- sample variance .

In machine learning, many tasks are to learn how to select a parameter from the data available.  thetawhich best describes this data. In mathematical statistics, the maximum likelihood method is often used to solve a similar problem.

In real life, the error distribution often has a normal distribution. For some justification, we state the central limit theorem .

Theorem 1 (CLT):

If random variables  xi1, ldots, xin- independent, equally distributed, mathematical expectation M( xii)=avariance D( xii)= sigma2 in(0,+ infty) space foralli in overline1,nthen

\ lim \ limits_ {n \ to \ infty} P \ {\ frac {\ xi_1 + \ xi_2 + \ ldots + \ xi_n - na} {\ sigma \ sqrt {n}} \ leq x \} = F (x) = \ frac {1} {\ sqrt {2 \ pi}} \ int \ limits _ {- \ infty} ^ xe ^ {- u ^ 2/2} du.


Below, we formulate the maximum likelihood method and consider its operation as an example of a family of normal distributions.

Maximum likelihood method


Let for a statistical model (B, \ mathscr {B}, \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \}) two conditions are satisfied:


Definition 7:

Maximum Likelihood Assessment (OMP)  hat thetaparameter  thetacalled empirically constructed Pncorresponding to the sample X=(x1, ldots,xn), value  theta in Thetaat which  max limits theta in Theta int lnf theta(x)Pn(dx)= max limits theta in Theta frac1n sum limitsi=1n lnf theta(x).

Definition 8:

Function  Lambda theta(X)= prod limitsi=1nf theta(xi)as a function of  thetais called the likelihood function , and the function L(X, theta)= sum limitsi=1n lnf theta(xi)- logarithmic likelihood function .

These functions peak at the same values.  theta, as  lnx- monotonous increasing function.

Example:

\ mathscr {P} = \ {N (a, \ sigma ^ 2) \ space | \ space a \ in \ mathbb {R}, \ space \ sigma \ in (0, + \ infty) \} - family of normal distributions with densities \ phi_ {a, \ sigma ^ 2} (x) = \ frac {1} {\ sigma \ sqrt {2 \ pi}} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} (xa ) ^ 2 \} . By sample X=(x1, ldots,xn)

\ Lambda_ {a, \ sigma} (X) = \ frac {1} {(2 \ pi) ^ {\ frac {n} {2}} \ sigma ^ n} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ {i = 1} ^ n (x_j-a) ^ 2 \};

L(X,(a, sigma))= fracn2 ln2 pin ln sigma frac12 sigma2 sum limitsi=1n(xia)2;

 frac partialL partiala= frac1 sigma2 sum limitsi=1n(xia), quad frac partialL partial sigma= fracn sigma+ frac1 sigma3 sum limitsi=1n(xia)2;

 frac partialL partiala=0 quad Rightarrow quad sum limitsi=1nxina=0 quad Rightarrow quad frac1n sum limitsi=1nxi= overlineX= hata;

 frac partialL partial sigma=0 quad Rightarrow quad fracn sigma= frac1 sigma3 sum limitsi=1n(xia)2 quad Rightarrow quad hat sigma= sqrt frac1n sum limitsi=1n(xi overlineX)2= sqrtS2.

Estimates for mathematical expectation and variance were obtained.

If you look closely at the formula

L(X,(a, sigma))= fracn2 ln2 pin ln sigma frac12 sigma2 sum limitsi=1n(xia)2

we can conclude that the function L(X,(a, sigma))assumes its maximum value when  sum limitsi=1n(xia)2is minimal. In machine learning problems, the least squares method is often used, in which the sum of the squared deviations of the predicted values ​​from the true ones is minimized.

Bibliography:


Source: https://habr.com/ru/post/474478/


All Articles