当前位置：网站首页>[mathematical basis of machine learning] (XIII) probability and distributions (II)
[mathematical basis of machine learning] (XIII) probability and distributions (II)
20210724 05:45:02 【Binary artificial intelligence】
List of articles
6 Probability and distribution (Probability and Distributions)( Next )
6.6 Conjugate and exponential families
Many of the things we find in statistics textbooks “ There is a name ” Probability distribution of , Are used to simulate specific types of phenomena . For example, we are in 6.5 We have seen the Gaussian distribution in section . These distributions are also interrelated in a complex way (Leemis and McQueston, 2008). For beginners, the field of , Determining which distribution to use can be difficult . Besides , Many distributions were discovered in the era of statistics and calculations using pencils and paper . People naturally ask , In the computer age , What is a meaningful concept (Efron and Hastie, 2016)？ In the previous section , We see , When the distribution is Gaussian , Many operations can be easily calculated . At this point , It is necessary to review the requirements of operating probability distribution in machine learning :

When applying probability algorithms , There are some “ Sealing property ”, Such as Bayes theorem . Closure refers to the return of objects of the same type after applying specific operations to a class of objects .

When we add newly collected data to the dataset , We don't need more parameters to describe the distribution .

Because we want to learn from the data , So we hope that the parameter estimation can perform well
The fact proved that , It is called the exponential family (exponential family) The distribution class achieves a good balance between maintaining good calculation and derivation characteristics and generality . Before we introduce the exponential family , Let's look at three “ name ” A probability distribution ： Bernoulli distribution ( example 6.8)、 The binomial distribution ( example 6.9) and Beta Distribution ( example 6.10).
example 6.8
Bernoulli distribution (Bernoulli distribution) It's state $x∈{0,1}$ A single binary random variable $X$ The distribution of . It is governed by a single continuous parameter $μ∈[0,1]$ control , Express $X=1$ Probability . Bernoulli distribution $Ber(μ)$ Defined as
$p(x∣μ)=μ_{x}(1−μ)_{1−x},x∈{0,1}$
$E[x]=μ$
$V[x]=μ(1−μ)$
among $E[x]$ and $V[x]$ Is a binary distributed random variable $X$ The mean and variance of .
An example where Bernoulli distribution can be used is , Appears when a coin is tossed “ positive ” Probabilistic modeling .
remarks ：
The Bernoulli distribution formula is rewritten above , Numerical value 0 or 1 To represent Boolean variables and express them in exponents , This is a skill often used in machine learning textbooks . Another case is when representing polynomial distribution .
chart 6.10 Some examples of binomial distribution ,$μ∈{0.1,0.4,0.75},N=15$
example 6.9 The binomial distribution
Binomial distribution is a generalization of Bernoulli distribution to integers ( Pictured 6.10 Shown ). Specially , Two terms can be used to describe in $p(X=1)=μ∈[0,1]$ In the Bernoulli distribution ,$N$ ... was observed in samples $m$ time $X=1$ Probability . The binomial distribution $Bin(N,μ)$ Defined as
$p(m∣N,μ)=(Nm )μ_{m}(1−μ)_{N−m}$
$E[m]V[m] =Nμ=Nμ(1−μ) $
among $E[m]$ and $V[m]$ Respectively $m$ The mean and variance of .
An example of a binomial distribution that can be used is described in $N$ In the coin toss experiment $m$ A positive probability , The probability of positive observation in a single experiment is $μ$.
chart 6.11 $α$ and $β$ Take different values Beta Distribution example .
example 6.10 Beta Distribution
We may want to build a model of continuous random variables on a finite interval .Beta The distribution is a continuous random variable $μ∈[0,1]$ The distribution on , It is usually used to represent the probability of some binary events ( for example , Parameters controlling Bernoulli distribution ).Beta Distribution $Beta(α,β)$( Pictured 6.11 Shown ) Itself is controlled by two parameters $α>0,β>0$, And is defined as
$p(μ∣α,β)=Γ(α)Γ(β)Γ(α+β) μ_{α−1}(1−μ)_{β−1}（6.98）$
$E[μ]=α+βα ,V[μ]=(α+β)_{2}(α+β+1)αβ $
among $Γ(⋅)$ For gamma (Gamma) function , Defined as ：
$Γ(t)Γ(t+1) :=∫_{0}x_{t−1}exp(−x)dx,t>0=tΓ(t). $
Be careful ,（6.98） in Gamma The fraction of the function is standardized Beta Distribution .
Look directly at ,$α$ Move the probability mass to 1, and $β$ Move the probability mass to 0. Some special circumstances ：
 about $α=1=β$, We get a uniform distribution $U[0,1]$
 about $α,β<1$, We get a bimodal distribution , The peak is 0 and 1 It's about .
 about $α,β>1$ , The distribution is unimodal .
 about $α,β>1$ And $α=β$, The distribution is unimodal and symmetrical , And concentrated in the interval $[0,1]$, That is, the mean is $1/2$.
remarks ：
There are a lot of distributions with names , They are connected in different ways (Leemis and McQueston, 2008). It's worth remembering , Each named distribution is created for a specific reason , But there may be other applications . Understand the reasons behind creating a particular distribution and know how to best use it . In order to be able to illustrate ( The first 6.6.1 section ) And exponential families ( The first 6.6.3 section ) The concept of , We introduced the previous three distributions .
6.6.1 conjugate
According to Bayes theorem (6.23), A posteriori is proportional to the product of a priori and likelihood . But it's hard to determine what a priori is , There are two reasons ： First of all , A priori should know about the problem before we see any data . This is often difficult to describe . second , We also consider whether we can analytically calculate the a posteriori distribution . However , There are some priors that are easy to calculate : Conjugate prior (conjugate priors).
Definition 6.13 Conjugate prior
If a posteriori has the same form as a priori / type , Then a priori is the conjugate of the likelihood function (conjugate).
Conjugation is particularly convenient , Because we can calculate the posterior distribution by algebraic method by updating the parameters of the prior distribution .
remarks ：
When considering several aspects of probability distribution , Conjugate priors preserve the distance structure of likelihood (Agarwal and Daum´e III, 2010).
example 6.11 Beta Distribution  Binomial distribution conjugate
Consider a binomial random variable $x∼Bin(N,μ)$, among ：
$p(x∣N,μ)=(Nx )μ_{x}(1−μ)_{N−x},x=0,1,…,N$
yes $N$ Toss a coin to get $x$ The probability of a sub positive , among $μ$ Is the probability of getting a positive every time . Let's give the parameters $μ$ One Beta transcendental , namely $μ∼Beta(α,β)$, among
$p(μ∣α,β)=Γ(α)Γ(β)Γ(α+β) μ_{α−1}(1−μ)_{β−1}$
If we observe now $x=h$ Result , in other words , We are $N$ I saw... In the coin toss $h$ On the front , So we can calculate $μ$ A posteriori distribution of
$p(μ∣x=h,N,α,β) ∝p(x∣N,μ)p(μ∣α,β)∝μ_{h}(1−μ)_{(N−h)}μ_{α−1}(1−μ)_{β−1} $
$=μ_{h+α−1}(1−μ)_{(N−h)+β−1}$
$∝Beta(h+α,N−h+β)$
($a∝b：aAndbIs proportional to thenamely ：a=constant⋅b$)
namely , A posteriori distribution is the same as a priori distribution Beta Distribution , That is, for the parameters in the binomial likelihood function $μ$,Beta A priori is conjugate .
In the following example , We will derive a relation with Beta The results of binomial conjugation are similar . ad locum , We will prove Beta The distribution is a conjugate a priori of Bernoulli distribution .
example 6.12 Beta Distribution  Bernoulli distribution
Make $x∈{0,1}$ It's about parameters $θ∈[0,1]$ Bernoulli distribution of , namely $p(x=1∣θ)=θ$, This can be expressed as ：$p(x∣θ)=θ_{x}(1−θ)_{1−x}$, Make $θ$ For about parameters $α,β$ Of Beta Distribution , namely $p(θ∣α,β)∝θ_{α−1}(1−θ)_{β−1}$
Give Way Beta The distribution is multiplied by the Bernoulli distribution , We get ：
$p(θ∣x,α,β) =p(x∣θ)p(θ∣α,β)∝θ_{x}(1−θ)_{1−x}θ_{α−1}(1−θ)_{β−1} $
$=θ_{α+x−1}(1−θ)_{β+(1−x)−1}∝p(θ∣α+x,β+(1−x)) $
The posterior parameter is obtained as $(α+x,β+(1−x))$ Of Beta Distribution .
surface 6.2 Conjugate priors of common likelihood functions .
surface 6.2 Some conjugate priors of the parameters of the standard likelihood for probability modeling are listed . Distribution, such as polynomial distribution 、 The inverse Gamma Distribution 、 The inverse Wishart The distribution and Dirichlet The distribution can be found in any statistical text , For example, in Bishop(2006) Described in .
Beta The distribution is about the parameters in the likelihood of binomial distribution and Bernoulli distribution $μ$ Conjugate priors of . For Gaussian likelihood function , We can set a conjugate Gaussian a priori on the mean . The reason why Gaussian likelihood appears twice in the table is that we need to distinguish between univariate and multivariate cases . At one yuan （ Scalar ） Under the circumstances , The inverse Gamma Is the conjugate a priori of variance . In a pluralistic situation , We use inverse Wishart Distribution as a conjugate a priori of covariance matrix .Dirichlet The distribution is a conjugate a priori of polynomial likelihood function . For more details , see also Bishop（2006）.
6.6.2 Sufficient statistics
Think about it , The randomness of the statistical variable is the randomness of the function . for example , If $x=[x_{1},…,x_{N}]_{⊤}$ Is a vector of a unary Gaussian random variable , in other words ,$x_{n}∼N(μ,σ_{2})$, So the sample mean $μ^ =N1 (x_{1}+⋯+x_{N})$ It's a statistic .
Sir Ronald Fisher Sufficient statistics have been established (sufficient statistics) The concept of ： Sufficient statistics will contain all available information , This information can be inferred from the data corresponding to the distribution of interest . let me put it another way , Sufficient statistics contain all the information needed to infer the population , in other words , They are sufficient statistics to represent the distribution .
For a parameter is $θ$ Distribution set of , set up $X$ It's a random variable , Given an unknown $θ_{0}$, There is a distribution $p(x∣θ_{0})$. If the statistics vector $ϕ(x)$ Include about $θ_{0}$ All possible information about , It's called $θ_{0}$ A sufficient statistic of .“ Include all possible information ” Means given $θ$ Of $x$ Probability $p(x∣θ_{0})$ Can be decomposed into $θ$ Irrelevant parts , And by $ϕ(x)$ Depend on indirectly $θ$ Another part of .FisherNeyman The factorization theorem formalizes this concept , We're talking about the theorem 6.14 This concept is also given in , But there is no proof .
Theorem 6.14 FisherNeyman
Make $X$ Probability density function $p(x∣θ)$. So statistics $ϕ(x)$ about $θ$ Is sufficient if and only if $p(x∣θ)$ Can be written in the following form ：
$p(x∣θ)=h(x)g_{θ}(ϕ(x))$
among $h(x)$ Is independent of $θ$ The distribution of , and $g_{θ}$ Through sufficient statistics $ϕ(x)$ Got the right $θ$ All the dependencies of .
If $p(x∣θ)$ Don't depend on $θ$, Then for any function $ϕ$,$ϕ(x)$ It's full statistics . More interesting is $p(x∣θ)$ Depend only on $ϕ(x)$ instead of $x$ In itself . under these circumstances ,$ϕ(x)$ It's also $θ$ A sufficient statistic of .
In machine learning , We consider a finite number of samples from a distribution . As you can imagine , For simple distributions ( As the sample 6.8 Bernoulli function in ), We only need a small number of samples to estimate the parameters of the distribution . We can also consider the opposite problem : If we have a set of data ( Samples from unknown distributions ), Which distribution best fits this set of data ? A natural question is , When we observe more data , We need more $θ$ Parameters to describe the distribution ? The fact proved that , Generally speaking, the answer is yes , This has been studied in Nonparametric Statistics (Wasserman, 2007). Another opposite question is which kind of distribution has sufficient statistics of finite dimensions , That is, the number of parameters required to describe them will not increase arbitrarily . The answer is exponential family distribution , Will be described in the next section .
6.6.3 Exponential families
Considering distribution ( Discrete or continuous random variables ) when , We can have three possible levels of abstraction . At the first level ( This is the most specific level ), We have a specific... With fixed parameters “ name ” Distribution , For example, a mean value is zero , The variance is the univariate Gaussian distribution of the identity matrix $N(0,1)$. And in machine learning , We often use the second level of abstraction , That is, we use a fixed distribution in the form of parameters ( Univariate Gaussian distribution ), And infer its parameters from the data . for example , Let's assume an unknown mean $μ$ And unknown variance $σ_{2}$ The univariate Gaussian distribution of $N(μ,σ_{2})$, The best parameters are determined by maximum likelihood fitting $(μ,σ_{2})$. We will be in the 9 When we discuss linear regression in this chapter, we see an example . The third level of abstraction is to consider distributed families , In this book , We consider exponential families . Univariate Gaussian function is an example of exponential family . Many widely used statistical models , Including table 6.2 All of the “ name ” Model , All belong to the exponential family . They can all be unified into one concept (Brown, 1986).
remarks ：
A brief historical anecdote ： Like many concepts in mathematics and Science , The index family is independently discovered by different researchers at the same time .19351936 year , Tasmania Edwin Pitman、 Paris Georges Darmois And New York Bernard Koopman It is proved that the exponential family is the only family of finite dimensional sufficient statistics under repeated independent sampling (Lehmann and Casella, 1998).
Exponential families (exponential family) By $θ∈R_{D}$ Parameterized probability distribution family , In the form of ：
$p(x∣θ)=h(x)exp(*θ,ϕ(x)*−A(θ))(6.107)$
among $ϕ(x)$ For fully statistical vectors . Generally speaking , Any inner product ( The first 3.2 section ) Can be in (6.107) Use in , To specify , We use the standard dot product here $(*θ,ϕ(x)*=θ_{⊤}ϕ(x))$. Please note that , The form of the exponential family is essentially FisherNeyman Theorem ( Theorem 6.14) in $g_{θ}(φ(x))$ A specific expression for .
Through the sufficient statistics $ϕ(x)$ Add another term to the vector of ($lnh(x)$), And constrain the corresponding parameters $θ_{0}=1$, You can factor $h(x)$ Merge into the dot product term . term $A(θ)$ Is the normalized constant , It guarantees that the sum or integral of the distribution is 1, It's called the logarithmic partition function (logpartition function). A very intuitive concept of exponential family is to ignore these two terms and regard the exponential family as a distribution in the following form ：
$p(x∣θ)∝exp(θ_{⊤}ϕ(x))$
For this parameterized form , Parameters $θ$ It's called a natural parameter (natural parameters). At first glance , The exponential family seems to be an ordinary transformation , Just by substituting the dot product into the exponential function . However , Based on what we can learn from $ϕ(x)$ The fact of getting information about the data , We can achieve convenient modeling and efficient calculation .
example 6.13 Gaussian distribution as exponential family
Consider the univariate Gaussian distribution $N(μ,σ_{2})$, Make $ϕ(x)=[xx_{2} ]$
Then use the definition of exponential family ,
$p(x∣θ)∝exp(θ_{1}x+θ_{2}x_{2})(6.109)$
Make ：
$θ=[σ_{2}μ ,−2σ_{2}1 ]_{⊤}(6.110)$
Plug in (6.109) obtain
$p(x∣θ)∝exp(σ_{2}μx −2σ_{2}x_{2} )∝exp(−2σ_{2}1 (x−μ)_{2})$
therefore , Univariate Gaussian distribution belongs to exponential family , With sufficient statistics $ϕ(x)=[xx_{2} ]$, Natural parameters $θ$ stay (6.110) Give in .
example 6.14 Bernoulli distribution as an exponential family
Think back to example 6.8 Bernoulli distribution in
$p(x∣μ)=μ_{x}(1−μ)_{1−x},x∈{0,1}$
This can be written in the form of an exponential family
$p(x∣μ)=exp[g(μ_{x}(1−μ)_{1−x})]$
$=exp[xgμ+(1−x)g(1−μ)]$
$=exp[xgμ−xg(1−μ)+g(1−μ)]$
$=exp[xg1−μμ +g(1−μ)](6.113d)$
The last line (6.113d) It can be determined in the form of exponential family by observation (6.107)：
$h(x)=1$
$θ=log1−μμ $
$ϕ(x)=x$
$A(θ)=−g(1−μ)=g(1+exp(θ))(6.117)$
$θ$ and $µ$ The relationship is reversible , therefore
$μ=1+exp(−θ)1 (6.118)$
Use a relational expression (6.118) You can get (6.117).
remarks ：
Original Bernoulli parameter $μ$ And natural parameters $θ$ The relationship between them is called sigmoid or logistic function . Observation knows $μ∈(0,1)$ but $θ∈R$, therefore sigmoid The function squeezes the input to $(0,1)$. This property is often used in machine learning , For example, for logistic regression (Bishop,2006 year ,4.3.2 section ), And as a nonlinear activation function of neural network (Goodfellow et al., 2016, chapter 6).
Find the conjugate distribution of a particular distribution ( As shown in the table 6.2 Shown ) The parameter form of is often not obvious . The exponential family provides a convenient way to find the conjugate pairs of distribution . Consider random variables $X$ Belonging to the exponential family (6.107):
$p(x∣θ)=h(x)exp(*θ,ϕ(x)*−A(θ))$
Every member of the exponential family has a conjugate a priori (Brown, 1986)
$p(θ∣γ)=h_{c}(θ)exp(*[γ_{1}γ_{2} ],[θ−A(θ) ]*−A_{c}(γ))$
among $γ=[γ_{1}γ_{2} ]$ Dimension for $dim(θ)+1$
The sufficient statistic of conjugate a priori is $[θ−A(θ) ]$. Using the general form of conjugate a priori of exponential family , We can derive a conjugate a priori function corresponding to a specific distribution .
example 6.15
Think about it (6.113d) Exponential family form of Bernoulli distribution in
$p(x∣μ)=exp[xg1−μμ +g(1−μ)]$
We define $γ:=[α,β+α]_{⊤}$ and $h_{c}(μ):=μ/(1−μ)$, Then the form of regular conjugate a priori is ：
$p(μ∣α,β)=1−μμ exp[αg1−μμ +(β+α)g(1−μ)−A_{c}(γ)](6.122)$
equation (6.122) It's reduced to ：
$p(μ∣α,β)=exp[(α−1)gμ+(β−1)g(1−μ)−A_{c}(α,β)]$
Write it in the form of a non exponential family ：
$p(μ∣α,β)∝μ_{α−1}(1−μ)_{β−1}$
We call it Beta Distribution (6.98). In case 6.12 in , We assume and prove Beta The distribution is a conjugate a priori of Bernoulli distribution . And in this case , We study the form of exponential family Bernoulli Regular conjugate priors of distributions , Deduce Beta The form of distribution .
As mentioned in the previous section , The main motivation for us to study exponential families is that they have sufficient statistical information in finite dimensions . Besides , Conjugate distributions are easy to write , And it also comes from an exponential family . From the perspective of reasoning , The maximum likelihood estimation of exponential family performs well , Because the empirical estimation of its sufficient statistics is the best estimation of the population of sufficient statistics （ Recall the mean and covariance of the Gaussian distribution ）. From the perspective of optimization , The log likelihood function of the exponential family is concave , Allows us to apply effective optimization methods ( The first 7 Chapter ).
6.7 Variable substitution / inverse transformation
There seem to be many known distributions , But actually , The distribution we can name is very limited . therefore , It is often useful to understand how random variables are distributed after transformation . for example , hypothesis $X$ According to the univariate normal distribution $N(0,1)$ A random variable obtained . that $X_{2}$ What is the distribution of ? Another common example in machine learning is , hypothesis $X_{1}$ and $X_{2}$ Is a univariate standard normal distribution , that $21 (X_{1}+X_{2})$ What is the distribution of ?
Calculation $21 (X_{1}+X_{2})$ One option for the distribution of is to calculate $X_{1}$ and $X_{2}$ The mean and variance of , Then combine them . As we are in the 6.4.4 As seen in Section , When we consider the affine transformation of random variables , We can calculate the mean and variance of the transformed random variables . However , We may not be able to get the functional form of the transformed distribution . Besides , We may also be concerned with the nonlinear transformation of random variables , At this time, the transformed closed form expression is not easy to get .
remarks ：( Symbol )
In this section , We will specify random variables and their values . therefore , Think about it , We use capital letters $X$, $Y$ Represents a random variable , In small letters $x$ ,$y$ Represents the random variable in the target space $T$ The value in . We will discrete random variables $X$ The probability mass function of is written as $P(X=x)$, For continuous random variables $X$( The first 6.2.2 section ), The probability density function is written as $f(x)$, The cumulative distribution function is written as $F_{X}(x)$.
We will introduce two methods of obtaining distribution through random variable transformation ： One is a direct method using the definition of cumulative distribution function , The other is to use the chain rule of calculus ( The first 5.2.2 section ) Variable substitution for (changeofvariable) Method . Variable substitution methods are widely used , Because it provides a method for calculating the distribution due to transformation “ The secret ”. We will explain the variable substitution technique for univariate random variables , The general results of multivariate random variables are briefly given .
The transformation of discrete random variables is easy to understand . Suppose there is a discrete random variable $X$, Its probability mass function is $P(X=x)$( The first 6.2.1 section ), There is a reversible function $U(x)$. Consider the converted random variables $Y:=U(X)$, The probability mass function is $P(Y=y)$, be
$P(Y=y)=P(U(X)=y)Transformation$
$=P(X=U_{−1}(y))The inverse(6.125b)$
We can get $x=U_{−1}(y)$. therefore , For discrete random variables , Transformation directly changes individual events ( Through appropriate probability transformation ).
6.7.1 Distribution function technique
The distribution function technique can be traced back to the basic principles , Use the cumulative distribution function (cdf) The definition of $F_{X}(x)=P(X⩽x)$ And its differential is a probability density function (pdf)$f(x)$ The fact that (Wasserman, 2004, chapter 2). For a random variable $X$ And a function $U$, We find random variables $Y:=U(X)$ The probability density function can be obtained by ：

seek cdf
$F_{Y}(y)=P(Y⩽y)$ 
Yes cdf$F_{Y}(y)$ We can derive pdf$f(y)$
$f(y)=dyd F_{Y}(y)$
We also need to note that the domain of random variables may be due to $U$ Change with the change of .
example 6.16
set up $X$ Is a continuous random variable , The probability density function is
$f(x)=3x_{2},0≤x≤1$
We want to ask for $Y=X_{2}$ Of pdf
function $f$ It's about $x$ Incremental function of , So about $y$ The result value of is in the interval $[0,1]$ Inside . We can get ：
$F_{Y}(y)=P(Y⩽y)cdfThe definition of$
$=P(X_{2}⩽y)Transformation$
$=P(X⩽y_{21})The inverse$
$=F_{X}(y_{21})cdfThe definition of$
$=∫_{0}3t_{2}dtIn the form of definite integralcdf$
$=[t_{3}]_{t=0}Integral results$
$=y_{23},0⩽y⩽1$
therefore , about $0≤y≤1$,$Y$ Of cdf by ：
$F_{Y}(y)=y_{23}$
In order to obtain pdf, We are right. cdf Differential , about $0≤y≤1$
$f(y)=dyd F_{Y}(y)=23 y_{21}$
In case 6.16 in , We consider strictly monotonically increasing functions $f(x)=3x_{2}$. This means that we can calculate its inverse function （ The existence of an inverse function is called a bijection function ,2.7 section ）. Generally speaking , We require the function of interest $y=U(x)$ With inverse $x=U_{−1}(y)$. The random variable $X$ The cumulative distribution function of $F_{X}(x)$ As a transformation function $U(x)$, You can get a useful result . This leads to the following theorem .
Theorem 6.15
Make $X$ Is a continuous random variable , And there is a strictly monotone cumulative distribution function $F_{X}(x)$. Then it is defined as ：
$Y:=F_{X}(X)$
Random variable of $Y$ There is a uniform distribution
Theorem 6.15 It is called probability integral transformation (probability integral transform), It is used to derive the algorithm for sampling from the distribution , This algorithm transforms the sampling results of uniform random variables （Bishop,2006）. The working principle of the algorithm is to generate samples from uniform distribution , Then through the inverse cumulative density function （ Suppose this is available ） Transform it , To obtain samples from the desired distribution . The probability integral transformation is also used to test whether the sample comes from a specific distribution （Lehmann and Romano,2005）. The view that the output of the cumulative distribution function is uniformly distributed also constitutes copulas The basis of （Nelsen,2006）.
6.7.2 Variable substitution
The first 6.7.1 The distribution function technique in this section is derived from the basic principles , It is based on the definition of cumulative distribution function , And using the inverse 、 Properties of differential and integral . The rationale is based on two facts ：
 We can Y The cumulative distribution function of is transformed into $X$ Cumulative distribution function of .
 We can obtain the probability density function by differentiating the cumulative distribution function .
Let's reason step by step , The purpose is to understand the theorem 6.16 More general variable replacement methods in
remarks ：
“ Variable substitution ” The name comes from the idea of changing the integral variable when we face a difficult integral . For unary functions , We use the commutative integration method ：
$∫f(g(x))g_{′}(x)dx=∫f(u)du,amongu=g(x)(6.133)$
The derivation of this rule is based on the chain rule of calculus (5.32), And the application of the basic theorem of quadratic calculus . The basic theorem of calculus proves that integral and differential are reciprocal to some extent “ The inverse ” Of . adopt ( Loosely ) Consider the equation $u=g(x)$ Small changes in ( differential ), Namely the $∆u=g_{′}(x)∆x$ regard as $u=g(x)$ Differential of , You can intuitively understand this rule . take $u=g(x)$ Plug in , integral (6.133) The parameter on the right becomes $f(g(x))$. Through hypothesis $du$ The term can be approximated as $du≈∆u=g_{′}(x)∆x$, also $dx≈∆x$, We finally got (6.133).
Consider a univariate random variable $X$, And a reversible function $U$,$U$ Another random variable is given $Y=U(X)$. Let's assume that random variables $X$ A stateful $x∈[a,b]$. According to the definition of probability distribution function , We have
$F_{Y}(y)=P(Y⩽y)$
What we are interested in is the function of random variables $U$
$P(Y⩽y)=P(U(X)⩽y)$
Hypothesis function $U$ It's reversible . Interval reversible functions are either strictly increasing or strictly decreasing . If $U$ Strictly increasing , Then its inverse $U_{−1}$ Also strictly increasing . By inverting $U_{−1}$ be applied to $P(U(X)≤y)$ Parameters of , We get
$P(U(X)⩽y)=P(U_{−1}(U(X))⩽U_{−1}(y))=P(X⩽U_{−1}(y))(6.136)$
(6.136) The rightmost item in the is $X$ The expression of the cumulative distribution function of . Recall the definition of the cumulative distribution function in the probability density function , We can get ：
$P(X⩽U_{−1}(y))=∫_{a}f(x)dx$
Now we have used $x$ Express $Y$ The cumulative distribution function of .
$F_{Y}(y)=∫_{a}f(x)dx(6.138)$
To get the probability density function , We are right. $y$ Derivation (6.138):
$f(y)=dyd F_{y}(y)=dyd ∫_{a}f(x)dx(6.139)$
Be careful , The integral on the right side of the equation is about $x$ Of , But we need information about $y$ Because we have to be right $y$ Derivative , therefore , according to (6.133), Yes ：
$∫f(U_{−1}(y))U_{−1_{′}}(y)dy=∫f(x)dxamongx=U_{−1}(y)(6.140)$
stay (6.139) Use... On the right of (6.140) obtain
$f(y)=dyd ∫_{a}f_{x}(U_{−1}(y))U_{−1_{′}}(y)dy$
Then let's recall that differential is a linear operator , We use subscripts $x$ To remind yourself $f_{x}(U_{−1}(y))$ yes $x$ Function of , instead of $y$ Function of . Again, use the basic theorem of calculus ( Derivative of integral upper bound function ), We get
$f(y)=f_{x}(U_{−1}(y))⋅(dyd U_{−1}(y))$
Think about it , Let's assume that $U$ Is a strictly increasing function . For decreasing functions , When we do the same derivation, we get a negative sign . We introduce the absolute value of the differential so that $U$ The increment and decrement of have the same expression :
$f(y)=f_{x}(U_{−1}(y))⋅∣∣∣∣ dyd U_{−1}(y)∣∣∣∣ (6.143)$
This is called variable replacement technology (changeofvariable technique).(6.143) Medium $∣∣∣ dyd U_{−1}(y)∣∣∣ $ The unit of measure volume is used in $U$ Volume change at ( See 5.3 Section definition of Jacobian matrix ).
remarks ：
. And （6.125b） Compared with the discrete case in , We have an additional factor $∣dyd U_{−1}(y)∣$. Continuous situations require more attention , Because for all $y$,$P(Y=y)=0$. Probability density function $f(y)$ Not described as an event $y$ Probability .
So far in this section , We've been learning about single variable substitution . For multivariate random variables , The situation is similar , It's just complicated . For the case of multivariate random variables , Absolute values cannot be used for multivariate functions , But with the determinant of the Jacobian matrix . Think about it (5.58), Jacobian matrix is a matrix composed of partial derivatives , And the existence of a nonzero determinant shows that we can find the inverse of Jacobian matrix . Think back to 4.1 The discussion in Section , Determinant can make our differential （ Cube volume ） Transformed by Jacobi into a parallelepiped . Let's summarize the previous discussion in the following theorem , It provides us with a multivariable substitution method for variables .
Theorem 6.16
Make $f(x)$ Is a multivariable continuous random variable $X$ The value of the probability density of , If for $x$ Define all values in the field , Vector valued functions $y=U(x)$ Is differentiable and reversible , Then the corresponding value $y$, $Y=U(X)$ The probability density of is given by ：
$f(y)=f_{x}(U_{−1}(y))⋅∣∣∣∣ dyd U_{−1}(y)∣∣∣∣ $
The key of this theorem is that the variable substitution of multivariate random variables follows the process of single variable substitution . First, we need to find the inverse transform , And substitute it into $x$ Density function of , Then calculate the determinant of Jacobian matrix and multiply it to get the result . The following example illustrates the case of binary random variables .
example 6.17
Consider a binary random variable $X$, It has a state $x=[x_{1}x_{2} ]$, The probability density function is zero ：
$f([x_{1}x_{2} ])=2π1 exp(−21 [x_{1}x_{2} ]_{⊤}[x_{1}x_{2} ])$
We use the theorem 6.16 The linear transformation of random variables is derived by using the variable substitution technique in （ The first 2.7 section ） The effect of . Consider a matrix $A∈R_{2×2}$, Defined as
$A=[ac bd ]$
We are interested in the state of $y=Ax$ Binary random variable $Y$ The probability density function of .
Think about it , For variable substitution , We need to $x$ The inverse transformation of is used as $y$ Function of . Because we consider linear transformation , The inverse transformation is given by the inverse matrix （ See the first 2.2.2 section ）. about $2×2$ matrix , We can write the formula explicitly , from
$[x_{1}x_{2} ]=A_{−1}[y_{1}y_{2} ]=ad−bc1 [d−c −ba ][y_{1}y_{2} ]$
Be careful $ad−bc$ yes $A$ The determinant of （ The first 4.1 section ）. The corresponding probability density function is ：
$f(x)=f(A_{−1}y)=2π1 exp(−21 y_{⊤}A_{−⊤}A_{−1}y)（6.148）$
The matrix times the vector and the partial derivative of the vector is the matrix itself （ The first 5.5 section ）, therefore
$∂y∂ A_{−1}y=A_{−1}$
Think back to 4.1 section , The determinant of the inverse of a matrix is the inverse of its determinant , So the determinant of Jacobian matrix is
$det(∂y∂ A_{−1}y)=ad−bc1 （6.150）$
We can now apply the theorem 6.16 Changes in the formula of medium variables , take （6.148） multiply （6.150）, obtain
$f(y) =f(x)∣∣∣∣ det(∂y∂ A_{−1}y)∣∣∣∣ =2π1 exp(−21 y_{⊤}A_{−⊤}A_{−1}y)∣ad−bc∣_{−1} $
Example 6.17 Based on a binary random variable , We can easily calculate the inverse of a matrix . For higher dimensions , The previous relationship also applies .
remarks ：
We according to the 6.5 We can see that ,（6.148） The density in the matrix $f(x)$ It's actually a standard Gaussian distribution , The transformed density $f(y)$ Is a binary Gaussian density function , The covariance is $Σ=AA_{⊤}$
We will use the ideas in this chapter in Chapter 8.4 Probabilistic modeling is described in section , And in the first place 8.5 A graph model is introduced in section . We will be in the 9 Zhang He 11 The direct application of these ideas in machine learning is seen in chapter .
.
Translated from ：
《MATHEMATICS FOR MACHINE LEARNING》 The author is Marc Peter Deisenroth,A Aldo Faisal and Cheng Soon Ong
Official account back office reply 【m4ml】 You can get this book .
in addition , The mathematical basis of machine learning .pdf
版权声明
本文为[Binary artificial intelligence]所创，转载请带上原文链接，感谢
https://chowdera.com/2021/06/20210621151155268g.html
边栏推荐
 Multiple assignments in an introduction to go (using Fibonacci sequence as an example)
 GF(2^8)有限域模乘的verilog组合逻辑实现
 Implementation of modular multiplication in GF (2 ^ 8) finite field with Verilog combinatorial logic
 常见的五种神经网络(5)生成对抗网络（上）之变分自动编码器
 Five common neural networks (5)  Gan, dcgan and wgan of countermeasure network (Part 2)
 Five kinds of common neural networks (5)  variational automatic encoder for generating countermeasure network (Part 1)
 Markdown mathematical formula
 数据分析模型 第一章
 Write a program, input x with scanf function, output y value.
 Data analysis model Chapter 1
猜你喜欢
随机推荐
 Symbols quick view
 Bprobability theory entropy and information gain
 Wu Enda machine learning (7)  neural network: Representation
 P1955 [NOI2015]程序自动分析
 易企秀H5 json配置文件解密分析
 Computer blue screen code collection
 Analysis on decryption of configuration file of eshow H5 JSON
 Adaptive Simpson integral and error proof
 「管理数学基础」1.1 矩阵理论：线性变换及其矩阵表示
 Curse of Dimensionality
 Conquer the limit of Higher Mathematics
 one hundred and ninetyone thousand and fourteen
 Intel Code Challenge Elimination Round (Div.1 + Div.2, combined) D. Generating Sets 贪心
 ＞＞右移运算符
 After introducing the set
 The application of critical state in Mathematics
 Luogu p3690 [template] link cut tree (LCT)
 "Fundamentals of management mathematics" 1.1 matrix theory: linear transformation and its matrix representation
 Intel code challenge elimination round (Div.1 + div.2, combined) d. generating sets greedy
 Shift right operator