当前位置:网站首页>[mathematical basis of machine learning] (XIII) probability and distributions (II)

[mathematical basis of machine learning] (XIII) probability and distributions (II)

2021-07-24 05:45:02 Binary artificial intelligence

6 Probability and distribution (Probability and Distributions)( Next )

6.6 Conjugate and exponential families

Many of the things we find in statistics textbooks “ There is a name ” Probability distribution of , Are used to simulate specific types of phenomena . For example, we are in 6.5 We have seen the Gaussian distribution in section . These distributions are also interrelated in a complex way (Leemis and McQueston, 2008). For beginners, the field of , Determining which distribution to use can be difficult . Besides , Many distributions were discovered in the era of statistics and calculations using pencils and paper . People naturally ask , In the computer age , What is a meaningful concept (Efron and Hastie, 2016)? In the previous section , We see , When the distribution is Gaussian , Many operations can be easily calculated . At this point , It is necessary to review the requirements of operating probability distribution in machine learning :

  1. When applying probability algorithms , There are some “ Sealing property ”, Such as Bayes theorem . Closure refers to the return of objects of the same type after applying specific operations to a class of objects .

  2. When we add newly collected data to the dataset , We don't need more parameters to describe the distribution .

  3. Because we want to learn from the data , So we hope that the parameter estimation can perform well

The fact proved that , It is called the exponential family (exponential family) The distribution class achieves a good balance between maintaining good calculation and derivation characteristics and generality . Before we introduce the exponential family , Let's look at three “ name ” A probability distribution : Bernoulli distribution ( example 6.8)、 The binomial distribution ( example 6.9) and Beta Distribution ( example 6.10).

example 6.8
Bernoulli distribution (Bernoulli distribution) It's state x ∈ { 0 , 1 } x∈\{0,1\} x{0,1} A single binary random variable X X X The distribution of . It is governed by a single continuous parameter μ ∈ [ 0 , 1 ] \mu∈[0,1] μ[0,1] control , Express X = 1 X = 1 X=1 Probability . Bernoulli distribution Ber ( μ ) \text{Ber}(\mu) Ber(μ) Defined as

p ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x , x ∈ { 0 , 1 } p(x \mid \mu)=\mu^{x}(1-\mu)^{1-x}, \quad x \in\{0,1\} p(xμ)=μx(1μ)1x,x{0,1}
E [ x ] = μ \mathbb{E}[x]=\mu E[x]=μ
V [ x ] = μ ( 1 − μ ) \mathbb{V}[x]=\mu(1-\mu) V[x]=μ(1μ)
among E [ x ] \mathbb{E}[x] E[x] and V [ x ] \mathbb{V}[x] V[x] Is a binary distributed random variable X X X The mean and variance of .

An example where Bernoulli distribution can be used is , Appears when a coin is tossed “ positive ” Probabilistic modeling .
 Insert picture description here

remarks :
The Bernoulli distribution formula is rewritten above , Numerical value 0 or 1 To represent Boolean variables and express them in exponents , This is a skill often used in machine learning textbooks . Another case is when representing polynomial distribution .

 Insert picture description here
chart 6.10 Some examples of binomial distribution , μ ∈ { 0.1 , 0.4 , 0.75 } , N = 15 \mu \in\{0.1,0.4,0.75\},N=15 μ{0.1,0.4,0.75},N=15

example 6.9 The binomial distribution

Binomial distribution is a generalization of Bernoulli distribution to integers ( Pictured 6.10 Shown ). Specially , Two terms can be used to describe in p ( X = 1 ) = μ ∈ [ 0 , 1 ] p(X = 1) =\mu∈[0,1] p(X=1)=μ[0,1] In the Bernoulli distribution , N N N ... was observed in samples m m m time X = 1 X = 1 X=1 Probability . The binomial distribution Bin ( N , μ ) \text{Bin}(N,\mu) Bin(N,μ) Defined as
p ( m ∣ N , μ ) = ( N m ) μ m ( 1 − μ ) N − m p(m \mid N, \mu)=\left(\begin{array}{l}N \\m\end{array}\right) \mu^{m}(1-\mu)^{N-m} p(mN,μ)=(Nm)μm(1μ)Nm
E [ m ] = N μ V [ m ] = N μ ( 1 − μ ) \begin{aligned}\mathbb{E}[m] &=N \mu \\\mathbb{V}[m] &=N \mu(1-\mu)\end{aligned} E[m]V[m]=Nμ=Nμ(1μ)
among E [ m ] \mathbb{E}[m] E[m] and V [ m ] \mathbb{V}[m] V[m] Respectively m m m The mean and variance of .

An example of a binomial distribution that can be used is described in N N N In the coin toss experiment m m m A positive probability , The probability of positive observation in a single experiment is μ \mu μ.
 Insert picture description here
chart 6.11 α α α and β β β Take different values Beta Distribution example .

example 6.10 Beta Distribution

We may want to build a model of continuous random variables on a finite interval .Beta The distribution is a continuous random variable μ ∈ [ 0 , 1 ] \mu∈[0,1] μ[0,1] The distribution on , It is usually used to represent the probability of some binary events ( for example , Parameters controlling Bernoulli distribution ).Beta Distribution Beta ( α , β ) \text{Beta}(α, β) Beta(α,β)( Pictured 6.11 Shown ) Itself is controlled by two parameters α > 0 , β > 0 α > 0, β > 0 α>0,β>0, And is defined as
p ( μ ∣ α , β ) = Γ ( α + β ) Γ ( α ) Γ ( β ) μ α − 1 ( 1 − μ ) β − 1 ( 6.98 ) p(\mu \mid \alpha, \beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1}\qquad (6.98) p(μα,β)=Γ(α)Γ(β)Γ(α+β)μα1(1μ)β16.98
E [ μ ] = α α + β , V [ μ ] = α β ( α + β ) 2 ( α + β + 1 ) \mathbb{E}[\mu]=\frac{\alpha}{\alpha+\beta}, \quad \mathbb{V}[\mu]=\frac{\alpha \beta}{(\alpha+\beta)^{2}(\alpha+\beta+1)} E[μ]=α+βα,V[μ]=(α+β)2(α+β+1)αβ

among Γ ( ⋅ ) \Gamma(\cdot) Γ() For gamma (Gamma) function , Defined as :
Γ ( t ) : = ∫ 0 ∞ x t − 1 exp ⁡ ( − x ) d x , t > 0 Γ ( t + 1 ) = t Γ ( t ) . \begin{aligned}\Gamma(t) &:=\int_{0}^{\infty} x^{t-1} \exp (-x) d x, \quad t>0 \\\Gamma(t+1) &=t \Gamma(t) .\end{aligned} Γ(t)Γ(t+1):=0xt1exp(x)dx,t>0=tΓ(t).

Be careful ,(6.98) in Gamma The fraction of the function is standardized Beta Distribution .

Look directly at , α α α Move the probability mass to 1, and β β β Move the probability mass to 0. Some special circumstances :

  • about α = 1 = β \alpha=1=\beta α=1=β, We get a uniform distribution U [ 0 , 1 ] \mathcal{U}[0,1] U[0,1]
  • about α , β < 1 \alpha,\beta\lt 1 α,β<1, We get a bimodal distribution , The peak is 0 and 1 It's about .
  • about α , β > 1 \alpha,\beta\gt 1 α,β>1 , The distribution is unimodal .
  • about α , β > 1 \alpha,\beta\gt 1 α,β>1 And α = β \alpha=\beta α=β, The distribution is unimodal and symmetrical , And concentrated in the interval [ 0 , 1 ] [0,1] [0,1], That is, the mean is 1 / 2 1/2 1/2.

remarks

There are a lot of distributions with names , They are connected in different ways (Leemis and McQueston, 2008). It's worth remembering , Each named distribution is created for a specific reason , But there may be other applications . Understand the reasons behind creating a particular distribution and know how to best use it . In order to be able to illustrate ( The first 6.6.1 section ) And exponential families ( The first 6.6.3 section ) The concept of , We introduced the previous three distributions .

6.6.1 conjugate

According to Bayes theorem (6.23), A posteriori is proportional to the product of a priori and likelihood . But it's hard to determine what a priori is , There are two reasons : First of all , A priori should know about the problem before we see any data . This is often difficult to describe . second , We also consider whether we can analytically calculate the a posteriori distribution . However , There are some priors that are easy to calculate : Conjugate prior (conjugate priors).

Definition 6.13 Conjugate prior

If a posteriori has the same form as a priori / type , Then a priori is the conjugate of the likelihood function (conjugate).

Conjugation is particularly convenient , Because we can calculate the posterior distribution by algebraic method by updating the parameters of the prior distribution .

remarks
When considering several aspects of probability distribution , Conjugate priors preserve the distance structure of likelihood (Agarwal and Daum´e III, 2010).

example 6.11 Beta Distribution - Binomial distribution conjugate

Consider a binomial random variable x ∼ Bin ⁡ ( N , μ ) x \sim \operatorname{Bin}(N, \mu) xBin(N,μ), among :
p ( x ∣ N , μ ) = ( N x ) μ x ( 1 − μ ) N − x , x = 0 , 1 , … , N p(x \mid N, \mu)=\left(\begin{array}{c}N \\x\end{array}\right) \mu^{x}(1-\mu)^{N-x}, \quad x=0,1, \ldots, N p(xN,μ)=(Nx)μx(1μ)Nx,x=0,1,,N

yes N N N Toss a coin to get x x x The probability of a sub positive , among μ \mu μ Is the probability of getting a positive every time . Let's give the parameters μ \mu μ One Beta transcendental , namely μ ∼ Beta ( α , β ) \mu∼\text{Beta}(α, β) μBeta(α,β), among
p ( μ ∣ α , β ) = Γ ( α + β ) Γ ( α ) Γ ( β ) μ α − 1 ( 1 − μ ) β − 1 p(\mu \mid \alpha, \beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1} p(μα,β)=Γ(α)Γ(β)Γ(α+β)μα1(1μ)β1

If we observe now x = h x = h x=h Result , in other words , We are N N N I saw... In the coin toss h h h On the front , So we can calculate μ \mu μ A posteriori distribution of
p ( μ ∣ x = h , N , α , β ) ∝ p ( x ∣ N , μ ) p ( μ ∣ α , β ) ∝ μ h ( 1 − μ ) ( N − h ) μ α − 1 ( 1 − μ ) β − 1 \begin{aligned}p(\mu \mid x=h, N, \alpha, \beta) & \propto p(x \mid N, \mu) p(\mu \mid \alpha, \beta) \\& \propto \mu^{h}(1-\mu)^{(N-h)} \mu^{\alpha-1}(1-\mu)^{\beta-1}\end{aligned} p(μx=h,N,α,β)p(xN,μ)p(μα,β)μh(1μ)(Nh)μα1(1μ)β1
= μ h + α − 1 ( 1 − μ ) ( N − h ) + β − 1 =\mu^{h+\alpha-1}(1-\mu)^{(N-h)+\beta-1} =μh+α1(1μ)(Nh)+β1
∝ Beta ⁡ ( h + α , N − h + β ) \propto \operatorname{Beta}(h+\alpha, N-h+\beta) Beta(h+α,Nh+β)
( a ∝ b : a   And   b   Is proportional to the   namely : a =   constant ⋅ b a \propto b: a \text { And } b\text { Is proportional to the } \text { namely :} a=\text { constant } \cdot b aba  And  b  Is proportional to the   namely :a=  constant b)

namely , A posteriori distribution is the same as a priori distribution Beta Distribution , That is, for the parameters in the binomial likelihood function μ \mu μ,Beta A priori is conjugate .

In the following example , We will derive a relation with Beta- The results of binomial conjugation are similar . ad locum , We will prove Beta The distribution is a conjugate a priori of Bernoulli distribution .

example 6.12 Beta Distribution - Bernoulli distribution

Make x ∈ { 0 , 1 } x\in \{0,1\} x{0,1} It's about parameters θ ∈ [ 0 , 1 ] \theta\in [0,1] θ[0,1] Bernoulli distribution of , namely p ( x = 1 ∣ θ ) = θ p(x=1 \mid \theta)=\theta p(x=1θ)=θ, This can be expressed as : p ( x ∣ θ ) = θ x ( 1 − θ ) 1 − x p(x \mid \theta)=\theta^{x}(1-\theta)^{1-x} p(xθ)=θx(1θ)1x, Make θ \theta θ For about parameters α , β \alpha,\beta α,β Of Beta Distribution , namely p ( θ ∣ α , β ) ∝ θ α − 1 ( 1 − θ ) β − 1 p(\theta \mid \alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1} p(θα,β)θα1(1θ)β1

Give Way Beta The distribution is multiplied by the Bernoulli distribution , We get :
p ( θ ∣ x , α , β ) = p ( x ∣ θ ) p ( θ ∣ α , β ) ∝ θ x ( 1 − θ ) 1 − x θ α − 1 ( 1 − θ ) β − 1 \begin{aligned}p(\theta \mid x, \alpha, \beta) &=p(x \mid \theta) p(\theta \mid \alpha, \beta) \\& \propto \theta^{x}(1-\theta)^{1-x} \theta^{\alpha-1}(1-\theta)^{\beta-1}\end{aligned} p(θx,α,β)=p(xθ)p(θα,β)θx(1θ)1xθα1(1θ)β1
= θ α + x − 1 ( 1 − θ ) β + ( 1 − x ) − 1 ∝ p ( θ ∣ α + x , β + ( 1 − x ) ) \begin{array}{l}=\theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1} \\\propto p(\theta \mid \alpha+x, \beta+(1-x))\end{array} =θα+x1(1θ)β+(1x)1p(θα+x,β+(1x))

The posterior parameter is obtained as ( α + x , β + ( 1 − x ) ) (\alpha+x, \beta+(1-x)) (α+x,β+(1x)) Of Beta Distribution .

surface 6.2 Conjugate priors of common likelihood functions .
 Insert picture description here

surface 6.2 Some conjugate priors of the parameters of the standard likelihood for probability modeling are listed . Distribution, such as polynomial distribution 、 The inverse Gamma Distribution 、 The inverse Wishart The distribution and Dirichlet The distribution can be found in any statistical text , For example, in Bishop(2006) Described in .

Beta The distribution is about the parameters in the likelihood of binomial distribution and Bernoulli distribution μ μ μ Conjugate priors of . For Gaussian likelihood function , We can set a conjugate Gaussian a priori on the mean . The reason why Gaussian likelihood appears twice in the table is that we need to distinguish between univariate and multivariate cases . At one yuan ( Scalar ) Under the circumstances , The inverse Gamma Is the conjugate a priori of variance . In a pluralistic situation , We use inverse Wishart Distribution as a conjugate a priori of covariance matrix .Dirichlet The distribution is a conjugate a priori of polynomial likelihood function . For more details , see also Bishop(2006).

6.6.2 Sufficient statistics

Think about it , The randomness of the statistical variable is the randomness of the function . for example , If x = [ x 1 , … , x N ] ⊤ \boldsymbol{x}=\left[x_{1}, \ldots, x_{N}\right]^{\top} x=[x1,,xN] Is a vector of a unary Gaussian random variable , in other words , x n ∼ N ( μ , σ 2 ) x_{n} \sim \mathcal{N}\left(\mu, \sigma^{2}\right) xnN(μ,σ2), So the sample mean μ ^ = 1 N ( x 1 + ⋯ + x N ) \hat{\mu}=\frac{1}{N}\left(x_{1}+\cdots+x_{N}\right) μ^=N1(x1++xN) It's a statistic .

Sir Ronald Fisher Sufficient statistics have been established (sufficient statistics) The concept of : Sufficient statistics will contain all available information , This information can be inferred from the data corresponding to the distribution of interest . let me put it another way , Sufficient statistics contain all the information needed to infer the population , in other words , They are sufficient statistics to represent the distribution .

For a parameter is θ θ θ Distribution set of , set up X X X It's a random variable , Given an unknown θ 0 θ_0 θ0, There is a distribution p ( x ∣ θ 0 ) p(x | θ_0) p(xθ0). If the statistics vector ϕ ( x ) \phi(x) ϕ(x) Include about θ 0 θ_0 θ0 All possible information about , It's called θ 0 θ_0 θ0 A sufficient statistic of .“ Include all possible information ” Means given θ θ θ Of x x x Probability p ( x ∣ θ 0 ) p(x | θ_0) p(xθ0) Can be decomposed into θ θ θ Irrelevant parts , And by ϕ ( x ) \phi(x) ϕ(x) Depend on indirectly θ θ θ Another part of .Fisher-Neyman The factorization theorem formalizes this concept , We're talking about the theorem 6.14 This concept is also given in , But there is no proof .

Theorem 6.14 Fisher-Neyman

Make X X X Probability density function p ( x ∣ θ ) p(x|\theta) p(xθ). So statistics ϕ ( x ) \phi(x) ϕ(x) about θ \theta θ Is sufficient if and only if p ( x ∣ θ ) p(x|\theta) p(xθ) Can be written in the following form :
p ( x ∣ θ ) = h ( x ) g θ ( ϕ ( x ) ) p(x \mid \theta)=h(x) g_{\theta}(\phi(x)) p(xθ)=h(x)gθ(ϕ(x))

among h ( x ) h(x) h(x) Is independent of θ θ θ The distribution of , and g θ g_θ gθ Through sufficient statistics ϕ ( x ) \phi(x) ϕ(x) Got the right θ θ θ All the dependencies of .

If p ( x ∣ θ ) p(x | θ) p(xθ) Don't depend on θ θ θ, Then for any function ϕ \phi ϕ, ϕ ( x ) \phi (x) ϕ(x) It's full statistics . More interesting is p ( x ∣ θ ) p(x | θ) p(xθ) Depend only on ϕ ( x ) \phi(x) ϕ(x) instead of x x x In itself . under these circumstances , ϕ ( x ) \phi(x) ϕ(x) It's also θ θ θ A sufficient statistic of .

In machine learning , We consider a finite number of samples from a distribution . As you can imagine , For simple distributions ( As the sample 6.8 Bernoulli function in ), We only need a small number of samples to estimate the parameters of the distribution . We can also consider the opposite problem : If we have a set of data ( Samples from unknown distributions ), Which distribution best fits this set of data ? A natural question is , When we observe more data , We need more θ θ θ Parameters to describe the distribution ? The fact proved that , Generally speaking, the answer is yes , This has been studied in Nonparametric Statistics (Wasserman, 2007). Another opposite question is which kind of distribution has sufficient statistics of finite dimensions , That is, the number of parameters required to describe them will not increase arbitrarily . The answer is exponential family distribution , Will be described in the next section .

6.6.3 Exponential families

Considering distribution ( Discrete or continuous random variables ) when , We can have three possible levels of abstraction . At the first level ( This is the most specific level ), We have a specific... With fixed parameters “ name ” Distribution , For example, a mean value is zero , The variance is the univariate Gaussian distribution of the identity matrix N ( 0 , 1 ) \mathcal{N}(0,1) N(0,1). And in machine learning , We often use the second level of abstraction , That is, we use a fixed distribution in the form of parameters ( Univariate Gaussian distribution ), And infer its parameters from the data . for example , Let's assume an unknown mean μ \mu μ And unknown variance σ 2 σ^2 σ2 The univariate Gaussian distribution of N ( μ , σ 2 ) \mathcal{N}\left(\mu, \sigma^{2}\right) N(μ,σ2), The best parameters are determined by maximum likelihood fitting ( μ , σ 2 ) (\mu,σ^2) (μ,σ2). We will be in the 9 When we discuss linear regression in this chapter, we see an example . The third level of abstraction is to consider distributed families , In this book , We consider exponential families . Univariate Gaussian function is an example of exponential family . Many widely used statistical models , Including table 6.2 All of the “ name ” Model , All belong to the exponential family . They can all be unified into one concept (Brown, 1986).

remarks
A brief historical anecdote : Like many concepts in mathematics and Science , The index family is independently discovered by different researchers at the same time .1935-1936 year , Tasmania Edwin Pitman、 Paris Georges Darmois And New York Bernard Koopman It is proved that the exponential family is the only family of finite dimensional sufficient statistics under repeated independent sampling (Lehmann and Casella, 1998).

Exponential families (exponential family) By θ ∈ R D \boldsymbol{\theta} \in \mathbb{R}^{D} θRD Parameterized probability distribution family , In the form of :
p ( x ∣ θ ) = h ( x ) exp ⁡ ( * θ , ϕ ( x ) * − A ( θ ) ) ( 6.107 ) p(\boldsymbol{x} \mid \boldsymbol{\theta})=h(\boldsymbol{x}) \exp (\langle\boldsymbol{\theta}, \boldsymbol{\phi}(\boldsymbol{x})\rangle-A(\boldsymbol{\theta}))\qquad (6.107) p(xθ)=h(x)exp(*θ,ϕ(x)*A(θ))(6.107)

among ϕ ( x ) \phi(\boldsymbol{x}) ϕ(x) For fully statistical vectors . Generally speaking , Any inner product ( The first 3.2 section ) Can be in (6.107) Use in , To specify , We use the standard dot product here ( * θ , ϕ ( x ) * = θ ⊤ ϕ ( x ) ) \left(\langle\boldsymbol{\theta}, \boldsymbol{\phi}(\boldsymbol{x})\rangle=\boldsymbol{\theta}^{\top} \boldsymbol{\phi}(\boldsymbol{x})\right) (*θ,ϕ(x)*=θϕ(x)). Please note that , The form of the exponential family is essentially Fisher-Neyman Theorem ( Theorem 6.14) in g θ ( φ ( x ) ) g_θ(φ(x)) gθ(φ(x)) A specific expression for .

Through the sufficient statistics ϕ ( x ) \phi(\boldsymbol{x}) ϕ(x) Add another term to the vector of ( ln ⁡ h ( x ) \ln h(\boldsymbol{x}) lnh(x)), And constrain the corresponding parameters θ 0 = 1 θ_0 = 1 θ0=1, You can factor h ( x ) h(\boldsymbol{x}) h(x) Merge into the dot product term . term A ( θ ) A(\boldsymbol{\theta}) A(θ) Is the normalized constant , It guarantees that the sum or integral of the distribution is 1, It's called the logarithmic partition function (log-partition function). A very intuitive concept of exponential family is to ignore these two terms and regard the exponential family as a distribution in the following form :
p ( x ∣ θ ) ∝ exp ⁡ ( θ ⊤ ϕ ( x ) ) p(\boldsymbol{x} \mid \boldsymbol{\theta}) \propto \exp \left(\boldsymbol{\theta}^{\top} \boldsymbol{\phi}(\boldsymbol{x})\right) p(xθ)exp(θϕ(x))

For this parameterized form , Parameters θ \boldsymbol{θ} θ It's called a natural parameter (natural parameters). At first glance , The exponential family seems to be an ordinary transformation , Just by substituting the dot product into the exponential function . However , Based on what we can learn from ϕ ( x ) \phi(\boldsymbol{x}) ϕ(x) The fact of getting information about the data , We can achieve convenient modeling and efficient calculation .

example 6.13 Gaussian distribution as exponential family

Consider the univariate Gaussian distribution N ( μ , σ 2 ) \mathcal{N}\left(\mu, \sigma^{2}\right) N(μ,σ2), Make ϕ ( x ) = [ x x 2 ] \boldsymbol{\phi}(x)=\left[\begin{array}{c}x \\x^{2}\end{array}\right] ϕ(x)=[xx2]
Then use the definition of exponential family ,
p ( x ∣ θ ) ∝ exp ⁡ ( θ 1 x + θ 2 x 2 ) ( 6.109 ) p(x \mid \boldsymbol{\theta}) \propto \exp \left(\theta_{1} x+\theta_{2} x^{2}\right)\qquad (6.109) p(xθ)exp(θ1x+θ2x2)(6.109)
Make :

θ = [ μ σ 2 , − 1 2 σ 2 ] ⊤ ( 6.110 ) \boldsymbol{\theta}=\left[\frac{\mu}{\sigma^{2}},-\frac{1}{2 \sigma^{2}}\right]^{\top}\qquad (6.110) θ=[σ2μ,2σ21](6.110)

Plug in (6.109) obtain
p ( x ∣ θ ) ∝ exp ⁡ ( μ x σ 2 − x 2 2 σ 2 ) ∝ exp ⁡ ( − 1 2 σ 2 ( x − μ ) 2 ) p(x \mid \boldsymbol{\theta}) \propto \exp \left(\frac{\mu x}{\sigma^{2}}-\frac{x^{2}}{2 \sigma^{2}}\right) \propto \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right) p(xθ)exp(σ2μx2σ2x2)exp(2σ21(xμ)2)

therefore , Univariate Gaussian distribution belongs to exponential family , With sufficient statistics ϕ ( x ) = [ x x 2 ] \boldsymbol{\phi}(x)=\left[\begin{array}{l}x \\x^{2}\end{array}\right] ϕ(x)=[xx2], Natural parameters θ \boldsymbol{\theta} θ stay (6.110) Give in .

example 6.14 Bernoulli distribution as an exponential family

Think back to example 6.8 Bernoulli distribution in
p ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x , x ∈ { 0 , 1 } p(x \mid \mu)=\mu^{x}(1-\mu)^{1-x}, \quad x \in\{0,1\} p(xμ)=μx(1μ)1x,x{0,1}
This can be written in the form of an exponential family
p ( x ∣ μ ) = exp ⁡ [ log ⁡ ( μ x ( 1 − μ ) 1 − x ) ] p(x \mid \mu)=\exp \left[\log \left(\mu^{x}(1-\mu)^{1-x}\right)\right] p(xμ)=exp[log(μx(1μ)1x)]
= exp ⁡ [ x log ⁡ μ + ( 1 − x ) log ⁡ ( 1 − μ ) ] =\exp [x \log \mu+(1-x) \log (1-\mu)] =exp[xlogμ+(1x)log(1μ)]
= exp ⁡ [ x log ⁡ μ − x log ⁡ ( 1 − μ ) + log ⁡ ( 1 − μ ) ] =\exp [x \log \mu-x \log (1-\mu)+\log (1-\mu)] =exp[xlogμxlog(1μ)+log(1μ)]
= exp ⁡ [ x log ⁡ μ 1 − μ + log ⁡ ( 1 − μ ) ] ( 6.113 d ) =\exp \left[x \log \frac{\mu}{1-\mu}+\log (1-\mu)\right]\qquad (6.113d) =exp[xlog1μμ+log(1μ)](6.113d)

The last line (6.113d) It can be determined in the form of exponential family by observation (6.107):
h ( x ) = 1 h(x)=1 h(x)=1
θ = l o g μ 1 − μ \theta=log\frac{\mu}{1-\mu} θ=log1μμ
ϕ ( x ) = x \phi(x)=x ϕ(x)=x
A ( θ ) = − log ⁡ ( 1 − μ ) = log ⁡ ( 1 + exp ⁡ ( θ ) ) ( 6.117 ) A(\theta)=-\log(1-\mu)=\log(1+\exp(\theta))\qquad (6.117) A(θ)=log(1μ)=log(1+exp(θ))(6.117)

θ θ θ and µ µ µ The relationship is reversible , therefore
μ = 1 1 + exp ⁡ ( − θ ) ( 6.118 ) \mu=\frac{1}{1+\exp (-\theta)}\qquad (6.118) μ=1+exp(θ)1(6.118)
Use a relational expression (6.118) You can get (6.117).

remarks
Original Bernoulli parameter μ \mu μ And natural parameters θ θ θ The relationship between them is called sigmoid or logistic function . Observation knows μ ∈ ( 0 , 1 ) \mu∈(0,1) μ(0,1) but θ ∈ R θ∈\mathbb{R} θR, therefore sigmoid The function squeezes the input to ( 0 , 1 ) (0,1) (0,1). This property is often used in machine learning , For example, for logistic regression (Bishop,2006 year ,4.3.2 section ), And as a nonlinear activation function of neural network (Goodfellow et al., 2016, chapter 6).

Find the conjugate distribution of a particular distribution ( As shown in the table 6.2 Shown ) The parameter form of is often not obvious . The exponential family provides a convenient way to find the conjugate pairs of distribution . Consider random variables X X X Belonging to the exponential family (6.107):

p ( x ∣ θ ) = h ( x ) exp ⁡ ( * θ , ϕ ( x ) * − A ( θ ) ) p(\boldsymbol{x} \mid \boldsymbol{\theta})=h(\boldsymbol{x}) \exp (\langle\boldsymbol{\theta}, \boldsymbol{\phi}(\boldsymbol{x})\rangle-A(\boldsymbol{\theta})) p(xθ)=h(x)exp(*θ,ϕ(x)*A(θ))

Every member of the exponential family has a conjugate a priori (Brown, 1986)
p ( θ ∣ γ ) = h c ( θ ) exp ⁡ ( * [ γ 1 γ 2 ] , [ θ − A ( θ ) ] * − A c ( γ ) ) p(\boldsymbol{\theta} \mid \boldsymbol{\gamma})=h_{c}(\boldsymbol{\theta}) \exp \left(\left\langle\left[\begin{array}{c}\gamma_{1} \\\gamma_{2}\end{array}\right],\left[\begin{array}{c}\boldsymbol{\theta} \\-A(\boldsymbol{\theta})\end{array}\right]\right\rangle-A_{c}(\boldsymbol{\gamma})\right) p(θγ)=hc(θ)exp(*[γ1γ2],[θA(θ)]*Ac(γ))

among γ = [ γ 1 γ 2 ] \gamma=\left[\begin{array}{l}\gamma_{1} \\\gamma_{2}\end{array}\right] γ=[γ1γ2] Dimension for dim ⁡ ( θ ) + 1 \operatorname{dim}(\boldsymbol{\theta})+1 dim(θ)+1

The sufficient statistic of conjugate a priori is [ θ − A ( θ ) ] \left[\begin{array}{c}\boldsymbol{\theta} \\-A(\boldsymbol{\theta})\end{array}\right] [θA(θ)]. Using the general form of conjugate a priori of exponential family , We can derive a conjugate a priori function corresponding to a specific distribution .

example 6.15

Think about it (6.113d) Exponential family form of Bernoulli distribution in
p ( x ∣ μ ) = exp ⁡ [ x log ⁡ μ 1 − μ + log ⁡ ( 1 − μ ) ] p(x \mid \mu)=\exp \left[x \log \frac{\mu}{1-\mu}+\log (1-\mu)\right] p(xμ)=exp[xlog1μμ+log(1μ)]

We define γ : = [ α , β + α ] ⊤ \gamma:=[\alpha, \beta+\alpha]^{\top} γ:=[α,β+α] and h c ( μ ) : = μ / ( 1 − μ ) h_{c}(\mu):=\mu /(1-\mu) hc(μ):=μ/(1μ), Then the form of regular conjugate a priori is :
p ( μ ∣ α , β ) = μ 1 − μ exp ⁡ [ α log ⁡ μ 1 − μ + ( β + α ) log ⁡ ( 1 − μ ) − A c ( γ ) ] ( 6.122 ) p(\mu \mid \alpha, \beta)=\frac{\mu}{1-\mu} \exp \left[\alpha \log \frac{\mu}{1-\mu}+(\beta+\alpha) \log (1-\mu)-A_{c}(\gamma)\right]\qquad (6.122) p(μα,β)=1μμexp[αlog1μμ+(β+α)log(1μ)Ac(γ)](6.122)
equation (6.122) It's reduced to :

p ( μ ∣ α , β ) = exp ⁡ [ ( α − 1 ) log ⁡ μ + ( β − 1 ) log ⁡ ( 1 − μ ) − A c ( α , β ) ] p(\mu \mid \alpha, \beta)=\exp \left[(\alpha-1) \log \mu+(\beta-1) \log (1-\mu)-A_{c}(\alpha, \beta)\right] p(μα,β)=exp[(α1)logμ+(β1)log(1μ)Ac(α,β)]

Write it in the form of a non exponential family :

p ( μ ∣ α , β ) ∝ μ α − 1 ( 1 − μ ) β − 1 p(\mu \mid \alpha, \beta) \propto \mu^{\alpha-1}(1-\mu)^{\beta-1} p(μα,β)μα1(1μ)β1

We call it Beta Distribution (6.98). In case 6.12 in , We assume and prove Beta The distribution is a conjugate a priori of Bernoulli distribution . And in this case , We study the form of exponential family Bernoulli Regular conjugate priors of distributions , Deduce Beta The form of distribution .

As mentioned in the previous section , The main motivation for us to study exponential families is that they have sufficient statistical information in finite dimensions . Besides , Conjugate distributions are easy to write , And it also comes from an exponential family . From the perspective of reasoning , The maximum likelihood estimation of exponential family performs well , Because the empirical estimation of its sufficient statistics is the best estimation of the population of sufficient statistics ( Recall the mean and covariance of the Gaussian distribution ). From the perspective of optimization , The log likelihood function of the exponential family is concave , Allows us to apply effective optimization methods ( The first 7 Chapter ).

6.7 Variable substitution / inverse transformation

There seem to be many known distributions , But actually , The distribution we can name is very limited . therefore , It is often useful to understand how random variables are distributed after transformation . for example , hypothesis X X X According to the univariate normal distribution N ( 0 , 1 ) \mathcal{N}(0,1) N(0,1) A random variable obtained . that X 2 X^2 X2 What is the distribution of ? Another common example in machine learning is , hypothesis X 1 X_1 X1 and X 2 X_2 X2 Is a univariate standard normal distribution , that 1 2 ( X 1 + X 2 ) \frac{1}{2} (X_1 + X_2) 21(X1+X2) What is the distribution of ?

Calculation 1 2 ( X 1 + X 2 ) \frac{1}{2} (X_1 + X_2) 21(X1+X2) One option for the distribution of is to calculate X 1 X_1 X1 and X 2 X_2 X2 The mean and variance of , Then combine them . As we are in the 6.4.4 As seen in Section , When we consider the affine transformation of random variables , We can calculate the mean and variance of the transformed random variables . However , We may not be able to get the functional form of the transformed distribution . Besides , We may also be concerned with the nonlinear transformation of random variables , At this time, the transformed closed form expression is not easy to get .

remarks :( Symbol )

In this section , We will specify random variables and their values . therefore , Think about it , We use capital letters X X X, Y Y Y Represents a random variable , In small letters x x x , y y y Represents the random variable in the target space T \mathcal{T} T The value in . We will discrete random variables X X X The probability mass function of is written as P ( X = x ) P(X = x) P(X=x), For continuous random variables X X X( The first 6.2.2 section ), The probability density function is written as f ( x ) f(x) f(x), The cumulative distribution function is written as F X ( x ) F_X(x) FX(x).

We will introduce two methods of obtaining distribution through random variable transformation : One is a direct method using the definition of cumulative distribution function , The other is to use the chain rule of calculus ( The first 5.2.2 section ) Variable substitution for (change-of-variable) Method . Variable substitution methods are widely used , Because it provides a method for calculating the distribution due to transformation “ The secret ”. We will explain the variable substitution technique for univariate random variables , The general results of multivariate random variables are briefly given .

The transformation of discrete random variables is easy to understand . Suppose there is a discrete random variable X X X, Its probability mass function is P ( X = x ) P(X = x) P(X=x)( The first 6.2.1 section ), There is a reversible function U ( x ) U(x) U(x). Consider the converted random variables Y : = U ( X ) Y:= U(X) Y:=U(X), The probability mass function is P ( Y = y ) P(Y = y) P(Y=y), be
P ( Y = y ) = P ( U ( X ) = y ) Transformation   P(Y=y)=P(U(X)=y) \quad \text { Transformation } P(Y=y)=P(U(X)=y) Transformation  
= P ( X = U − 1 ( y ) ) The inverse ( 6.125 b ) =P\left(X=U^{-1}(y)\right) \quad \text { The inverse }\qquad (6.125b) =P(X=U1(y)) The inverse (6.125b)

We can get x = U − 1 ( y ) x = U^{−1} (y) x=U1(y). therefore , For discrete random variables , Transformation directly changes individual events ( Through appropriate probability transformation ).

6.7.1 Distribution function technique

The distribution function technique can be traced back to the basic principles , Use the cumulative distribution function (cdf) The definition of F X ( x ) = P ( X ⩽ x ) F_{X}(x)=P(X \leqslant x) FX(x)=P(Xx) And its differential is a probability density function (pdf) f ( x ) f (x) f(x) The fact that (Wasserman, 2004, chapter 2). For a random variable X X X And a function U U U, We find random variables Y : = U ( X ) Y:=U(X) Y:=U(X) The probability density function can be obtained by :

  1. seek cdf
    F Y ( y ) = P ( Y ⩽ y ) F_{Y}(y)=P(Y \leqslant y) FY(y)=P(Yy)

  2. Yes cdf F Y ( y ) F_{Y}(y) FY(y) We can derive pdf f ( y ) f(y) f(y)

f ( y ) = d d y F Y ( y ) f(y)=\frac{\mathrm{d}}{\mathrm{d} y} F_{Y}(y) f(y)=dydFY(y)

We also need to note that the domain of random variables may be due to U U U Change with the change of .

example 6.16

set up X X X Is a continuous random variable , The probability density function is
f ( x ) = 3 x 2 , 0 ≤ x ≤ 1 f(x)=3 x^{2},0\le x\le 1 f(x)=3x2,0x1
We want to ask for Y = X 2 Y=X^{2} Y=X2 Of pdf

function f f f It's about x x x Incremental function of , So about y y y The result value of is in the interval [ 0 , 1 ] [0,1] [0,1] Inside . We can get :
F Y ( y ) = P ( Y ⩽ y ) cdf The definition of F_{Y}(y)=P(Y \leqslant y) \quad \text{cdf The definition of } FY(y)=P(Yy)cdf The definition of
= P ( X 2 ⩽ y )   Transformation =P\left(X^{2} \leqslant y\right)\quad \text{ Transformation } =P(X2y)  Transformation
= P ( X ⩽ y 1 2 )   The inverse   =P\left(X \leqslant y^{\frac{1}{2}}\right)\quad \text{ The inverse } =P(Xy21)  The inverse  
= F X ( y 1 2 )  cdf The definition of =F_{X}\left(y^{\frac{1}{2}}\right)\quad\text{ cdf The definition of } =FX(y21) cdf The definition of
= ∫ 0 y 1 2 3 t 2   d t In the form of definite integral cdf  =\int_{0}^{y^{\frac{1}{2}}} 3 t^{2} \mathrm{~d} t\quad \text{ In the form of definite integral cdf } =0y213t2 dt In the form of definite integral cdf 
= [ t 3 ] t = 0 t = y 1 2 Integral results =\left[t^{3}\right]_{t=0}^{t=y^{\frac{1}{2}}}\quad \text{ Integral results } =[t3]t=0t=y21 Integral results
= y 3 2 , 0 ⩽ y ⩽ 1 =y^{\frac{3}{2}}, \quad 0 \leqslant y \leqslant 1 =y23,0y1

therefore , about 0 ≤ y ≤ 1 0\le y\le 1 0y1, Y Y Y Of cdf by :
F Y ( y ) = y 3 2 F_{Y}(y)=y^{\frac{3}{2}} FY(y)=y23

In order to obtain pdf, We are right. cdf Differential , about 0 ≤ y ≤ 1 0\le y\le 1 0y1
f ( y ) = d d y F Y ( y ) = 3 2 y 1 2 f(y)=\frac{\mathrm{d}}{\mathrm{d} y} F_{Y}(y)=\frac{3}{2} y^{\frac{1}{2}} f(y)=dydFY(y)=23y21

In case 6.16 in , We consider strictly monotonically increasing functions f ( x ) = 3 x 2 f(x)=3x^2 f(x)=3x2. This means that we can calculate its inverse function ( The existence of an inverse function is called a bijection function ,2.7 section ). Generally speaking , We require the function of interest y = U ( x ) y=U(x) y=U(x) With inverse x = U − 1 ( y ) x=U^{−1}(y) x=U1(y). The random variable X X X The cumulative distribution function of F X ( x ) F_X(x) FX(x) As a transformation function U ( x ) U(x) U(x), You can get a useful result . This leads to the following theorem .

Theorem 6.15

Make X X X Is a continuous random variable , And there is a strictly monotone cumulative distribution function F X ( x ) F_X(x) FX(x). Then it is defined as :
Y : = F X ( X ) Y:=F_{X}(X) Y:=FX(X)
Random variable of Y Y Y There is a uniform distribution

Theorem 6.15 It is called probability integral transformation (probability integral transform), It is used to derive the algorithm for sampling from the distribution , This algorithm transforms the sampling results of uniform random variables (Bishop,2006). The working principle of the algorithm is to generate samples from uniform distribution , Then through the inverse cumulative density function ( Suppose this is available ) Transform it , To obtain samples from the desired distribution . The probability integral transformation is also used to test whether the sample comes from a specific distribution (Lehmann and Romano,2005). The view that the output of the cumulative distribution function is uniformly distributed also constitutes copulas The basis of (Nelsen,2006).

6.7.2 Variable substitution

The first 6.7.1 The distribution function technique in this section is derived from the basic principles , It is based on the definition of cumulative distribution function , And using the inverse 、 Properties of differential and integral . The rationale is based on two facts :

  1. We can Y The cumulative distribution function of is transformed into X X X Cumulative distribution function of .
  2. We can obtain the probability density function by differentiating the cumulative distribution function .

Let's reason step by step , The purpose is to understand the theorem 6.16 More general variable replacement methods in

remarks
“ Variable substitution ” The name comes from the idea of changing the integral variable when we face a difficult integral . For unary functions , We use the commutative integration method :
∫ f ( g ( x ) ) g ′ ( x ) d x = ∫ f ( u ) d u ,   among   u = g ( x ) ( 6.133 ) \int f(g(x)) g^{\prime}(x) \mathrm{d} x=\int f(u) \mathrm{d} u, \quad \text { among } \quad u=g(x)\qquad (6.133) f(g(x))g(x)dx=f(u)du,  among  u=g(x)(6.133)

The derivation of this rule is based on the chain rule of calculus (5.32), And the application of the basic theorem of quadratic calculus . The basic theorem of calculus proves that integral and differential are reciprocal to some extent “ The inverse ” Of . adopt ( Loosely ) Consider the equation u = g ( x ) u = g(x) u=g(x) Small changes in ( differential ), Namely the ∆ u = g ′ ( x ) ∆ x ∆u = g' (x)∆x u=g(x)x regard as u = g ( x ) u = g(x) u=g(x) Differential of , You can intuitively understand this rule . take u = g ( x ) u = g(x) u=g(x) Plug in , integral (6.133) The parameter on the right becomes f ( g ( x ) ) f(g(x)) f(g(x)). Through hypothesis d u du du The term can be approximated as d u ≈ ∆ u = g ′ ( x ) ∆ x du≈∆u = g'(x)∆x duu=g(x)x, also d x ≈ ∆ x dx≈∆x dxx, We finally got (6.133).

Consider a univariate random variable X X X, And a reversible function U U U, U U U Another random variable is given Y = U ( X ) Y = U(X) Y=U(X). Let's assume that random variables X X X A stateful x ∈ [ a , b ] x∈[a, b] x[a,b]. According to the definition of probability distribution function , We have
F Y ( y ) = P ( Y ⩽ y ) F_{Y}(y)=P(Y \leqslant y) FY(y)=P(Yy)
What we are interested in is the function of random variables U U U
P ( Y ⩽ y ) = P ( U ( X ) ⩽ y ) P(Y \leqslant y)=P(U(X) \leqslant y) P(Yy)=P(U(X)y)

Hypothesis function U U U It's reversible . Interval reversible functions are either strictly increasing or strictly decreasing . If U U U Strictly increasing , Then its inverse U − 1 U^{−1} U1 Also strictly increasing . By inverting U − 1 U^{−1} U1 be applied to P ( U ( X ) ≤ y ) P(U(X)\le y) P(U(X)y) Parameters of , We get
P ( U ( X ) ⩽ y ) = P ( U − 1 ( U ( X ) ) ⩽ U − 1 ( y ) ) = P ( X ⩽ U − 1 ( y ) ) ( 6.136 ) P(U(X) \leqslant y)=P\left(U^{-1}(U(X)) \leqslant U^{-1}(y)\right)=P\left(X \leqslant U^{-1}(y)\right)\qquad (6.136) P(U(X)y)=P(U1(U(X))U1(y))=P(XU1(y))(6.136)

(6.136) The rightmost item in the is X X X The expression of the cumulative distribution function of . Recall the definition of the cumulative distribution function in the probability density function , We can get :
P ( X ⩽ U − 1 ( y ) ) = ∫ a U − 1 ( y ) f ( x ) d x P\left(X \leqslant U^{-1}(y)\right)=\int_{a}^{U^{-1}(y)} f(x) \mathrm{d} x P(XU1(y))=aU1(y)f(x)dx

Now we have used x x x Express Y Y Y The cumulative distribution function of .
F Y ( y ) = ∫ a U − 1 ( y ) f ( x ) d x ( 6.138 ) F_{Y}(y)=\int_{a}^{U^{-1}(y)} f(x) \mathrm{d} x\qquad (6.138) FY(y)=aU1(y)f(x)dx(6.138)

To get the probability density function , We are right. y y y Derivation (6.138):
f ( y ) = d d y F y ( y ) = d d y ∫ a U − 1 ( y ) f ( x ) d x ( 6.139 ) f(y)=\frac{\mathrm{d}}{\mathrm{d} y} F_{y}(y)=\frac{\mathrm{d}}{\mathrm{d} y} \int_{a}^{U^{-1}(y)} f(x) \mathrm{d} x\qquad (6.139) f(y)=dydFy(y)=dydaU1(y)f(x)dx(6.139)

Be careful , The integral on the right side of the equation is about x x x Of , But we need information about y y y Because we have to be right y y y Derivative , therefore , according to (6.133), Yes :
∫ f ( U − 1 ( y ) ) U − 1 ′ ( y ) d y = ∫ f ( x ) d x among x = U − 1 ( y ) ( 6.140 ) \int f\left(U^{-1}(y)\right) U^{-1^{\prime}}(y) \mathrm{d} y=\int f(x) \mathrm{d} x \quad \text { among } \quad x=U^{-1}(y)\qquad (6.140) f(U1(y))U1(y)dy=f(x)dx among x=U1(y)(6.140)

stay (6.139) Use... On the right of (6.140) obtain
f ( y ) = d d y ∫ a U − 1 ( y ) f x ( U − 1 ( y ) ) U − 1 ′ ( y ) d y f(y)=\frac{\mathrm{d}}{\mathrm{d} y} \int_{a}^{U^{-1}(y)} f_{x}\left(U^{-1}(y)\right) U^{-1^{\prime}}(y) \mathrm{d} y f(y)=dydaU1(y)fx(U1(y))U1(y)dy

Then let's recall that differential is a linear operator , We use subscripts x x x To remind yourself f x ( U − 1 ( y ) ) f_x(U^{−1}(y)) fx(U1(y)) yes x x x Function of , instead of y y y Function of . Again, use the basic theorem of calculus ( Derivative of integral upper bound function ), We get
f ( y ) = f x ( U − 1 ( y ) ) ⋅ ( d d y U − 1 ( y ) ) f(y)=f_{x}\left(U^{-1}(y)\right) \cdot\left(\frac{\mathrm{d}}{\mathrm{d} y} U^{-1}(y)\right) f(y)=fx(U1(y))(dydU1(y))

Think about it , Let's assume that U U U Is a strictly increasing function . For decreasing functions , When we do the same derivation, we get a negative sign . We introduce the absolute value of the differential so that U U U The increment and decrement of have the same expression :
f ( y ) = f x ( U − 1 ( y ) ) ⋅ ∣ d d y U − 1 ( y ) ∣ ( 6.143 ) f(y)=f_{x}\left(U^{-1}(y)\right) \cdot\left|\frac{\mathrm{d}}{\mathrm{d} y} U^{-1}(y)\right|\qquad (6.143) f(y)=fx(U1(y))dydU1(y)(6.143)

This is called variable replacement technology (change-of-variable technique).(6.143) Medium ∣ d d y U − 1 ( y ) ∣ \left|\frac{\mathrm{d}}{\mathrm{d} y} U^{-1}(y)\right| dydU1(y) The unit of measure volume is used in U U U Volume change at ( See 5.3 Section definition of Jacobian matrix ).

remarks
. And (6.125b) Compared with the discrete case in , We have an additional factor ∣ d d y U − 1 ( y ) ∣ |\frac{\mathrm{d}}{\mathrm{d} y} U^{-1}(y) \mid dydU1(y). Continuous situations require more attention , Because for all y y y, P ( Y = y ) = 0 P(Y=y)=0 P(Y=y)=0. Probability density function f ( y ) f(y) f(y) Not described as an event y y y Probability .

So far in this section , We've been learning about single variable substitution . For multivariate random variables , The situation is similar , It's just complicated . For the case of multivariate random variables , Absolute values cannot be used for multivariate functions , But with the determinant of the Jacobian matrix . Think about it (5.58), Jacobian matrix is a matrix composed of partial derivatives , And the existence of a non-zero determinant shows that we can find the inverse of Jacobian matrix . Think back to 4.1 The discussion in Section , Determinant can make our differential ( Cube volume ) Transformed by Jacobi into a parallelepiped . Let's summarize the previous discussion in the following theorem , It provides us with a multivariable substitution method for variables .

Theorem 6.16

Make f ( x ) f (\boldsymbol{x}) f(x) Is a multivariable continuous random variable X X X The value of the probability density of , If for x \boldsymbol{x} x Define all values in the field , Vector valued functions y = U ( x ) \boldsymbol{y} = U (\boldsymbol{x}) y=U(x) Is differentiable and reversible , Then the corresponding value y \boldsymbol{y} y, Y = U ( X ) Y = U (X) Y=U(X) The probability density of is given by :
f ( y ) = f x ( U − 1 ( y ) ) ⋅ ∣ d d y U − 1 ( y ) ∣ f(\boldsymbol{y})=f_{\boldsymbol{x}}\left(U^{-1}(\boldsymbol{y})\right) \cdot\left|\frac{\mathrm{d}}{\mathrm{d} \boldsymbol{y}} U^{-1}(\boldsymbol{y})\right| f(y)=fx(U1(y))dydU1(y)

The key of this theorem is that the variable substitution of multivariate random variables follows the process of single variable substitution . First, we need to find the inverse transform , And substitute it into x \boldsymbol{x} x Density function of , Then calculate the determinant of Jacobian matrix and multiply it to get the result . The following example illustrates the case of binary random variables .

example 6.17

Consider a binary random variable X X X, It has a state x = [ x 1 x 2 ] \boldsymbol{x}=\left[\begin{array}{l}x_{1} \\x_{2}\end{array}\right] x=[x1x2], The probability density function is zero :
f ( [ x 1 x 2 ] ) = 1 2 π exp ⁡ ( − 1 2 [ x 1 x 2 ] ⊤ [ x 1 x 2 ] ) f\left(\left[\begin{array}{l}x_{1} \\x_{2}\end{array}\right]\right)=\frac{1}{2 \pi} \exp \left(-\frac{1}{2}\left[\begin{array}{l}x_{1} \\x_{2}\end{array}\right]^{\top}\left[\begin{array}{l}x_{1} \\x_{2}\end{array}\right]\right) f([x1x2])=2π1exp(21[x1x2][x1x2])

We use the theorem 6.16 The linear transformation of random variables is derived by using the variable substitution technique in ( The first 2.7 section ) The effect of . Consider a matrix A ∈ R 2 × 2 \boldsymbol{A} \in \mathbb{R}^{2 \times 2} AR2×2, Defined as
A = [ a b c d ] \boldsymbol{A}=\left[\begin{array}{ll}a & b \\c & d\end{array}\right] A=[acbd]
We are interested in the state of y = A x \boldsymbol{y}=\boldsymbol{A}\boldsymbol{x} y=Ax Binary random variable Y Y Y The probability density function of .

Think about it , For variable substitution , We need to x \boldsymbol{x} x The inverse transformation of is used as y \boldsymbol{y} y Function of . Because we consider linear transformation , The inverse transformation is given by the inverse matrix ( See the first 2.2.2 section ). about 2 × 2 2×2 2×2 matrix , We can write the formula explicitly , from
[ x 1 x 2 ] = A − 1 [ y 1 y 2 ] = 1 a d − b c [ d − b − c a ] [ y 1 y 2 ] \left[\begin{array}{l}x_{1} \\x_{2}\end{array}\right]=\boldsymbol{A}^{-1}\left[\begin{array}{l}y_{1} \\y_{2}\end{array}\right]=\frac{1}{a d-b c}\left[\begin{array}{cc}d & -b \\-c & a\end{array}\right]\left[\begin{array}{l}y_{1} \\y_{2}\end{array}\right] [x1x2]=A1[y1y2]=adbc1[dcba][y1y2]

Be careful a d − b c ad−bc adbc yes A \boldsymbol{A} A The determinant of ( The first 4.1 section ). The corresponding probability density function is :
f ( x ) = f ( A − 1 y ) = 1 2 π exp ⁡ ( − 1 2 y ⊤ A − ⊤ A − 1 y ) ( 6.148 ) f(\boldsymbol{x})=f\left(\boldsymbol{A}^{-1} \boldsymbol{y}\right)=\frac{1}{2 \pi} \exp \left(-\frac{1}{2} \boldsymbol{y}^{\top} \boldsymbol{A}^{-\top} \boldsymbol{A}^{-1} \boldsymbol{y}\right)\quad (6.148) f(x)=f(A1y)=2π1exp(21yAA1y)6.148

The matrix times the vector and the partial derivative of the vector is the matrix itself ( The first 5.5 section ), therefore
∂ ∂ y A − 1 y = A − 1 \frac{\partial}{\partial \boldsymbol{y}} \boldsymbol{A}^{-1} \boldsymbol{y}=\boldsymbol{A}^{-1} yA1y=A1

Think back to 4.1 section , The determinant of the inverse of a matrix is the inverse of its determinant , So the determinant of Jacobian matrix is
det ⁡ ( ∂ ∂ y A − 1 y ) = 1 a d − b c ( 6.150 ) \operatorname{det}\left(\frac{\partial}{\partial \boldsymbol{y}} \boldsymbol{A}^{-1} \boldsymbol{y}\right)=\frac{1}{a d-b c}\qquad (6.150) det(yA1y)=adbc16.150

We can now apply the theorem 6.16 Changes in the formula of medium variables , take (6.148) multiply (6.150), obtain
f ( y ) = f ( x ) ∣ det ⁡ ( ∂ ∂ y A − 1 y ) ∣ = 1 2 π exp ⁡ ( − 1 2 y ⊤ A − ⊤ A − 1 y ) ∣ a d − b c ∣ − 1 \begin{aligned}f(\boldsymbol{y}) &=f(\boldsymbol{x})\left|\operatorname{det}\left(\frac{\partial}{\partial \boldsymbol{y}} \boldsymbol{A}^{-1} \boldsymbol{y}\right)\right| \\&=\frac{1}{2 \pi} \exp \left(-\frac{1}{2} \boldsymbol{y}^{\top} \boldsymbol{A}^{-\top} \boldsymbol{A}^{-1} \boldsymbol{y}\right)|a d-b c|^{-1}\end{aligned} f(y)=f(x)det(yA1y)=2π1exp(21yAA1y)adbc1

Example 6.17 Based on a binary random variable , We can easily calculate the inverse of a matrix . For higher dimensions , The previous relationship also applies .

remarks
We according to the 6.5 We can see that ,(6.148) The density in the matrix f ( x ) f(\boldsymbol{x}) f(x) It's actually a standard Gaussian distribution , The transformed density f ( y ) f(\boldsymbol{y}) f(y) Is a binary Gaussian density function , The covariance is Σ = A A ⊤ \boldsymbol{\Sigma}=\boldsymbol{A} \boldsymbol{A}^{\top} Σ=AA

We will use the ideas in this chapter in Chapter 8.4 Probabilistic modeling is described in section , And in the first place 8.5 A graph model is introduced in section . We will be in the 9 Zhang He 11 The direct application of these ideas in machine learning is seen in chapter .
.

Translated from :
《MATHEMATICS FOR MACHINE LEARNING》 The author is Marc Peter Deisenroth,A Aldo Faisal and Cheng Soon Ong

Official account back office reply 【m4ml】 You can get this book .

in addition , The mathematical basis of machine learning .pdf

版权声明
本文为[Binary artificial intelligence]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/06/20210621151155268g.html