# Popular understanding of LDA topic model

2020-12-07 12:51:14

# Easy to understand LDA Theme model

## 0 Preface

In my impression , I first heard that “LDA” The term , It's because of rickjin stay 2013 year 3 A month written by LDA Popular science series , It's called LDA Math gossip , I always wanted to look like , Remember to print once , But I don't know because the preface of this document is too long （ Now I realize that “ matting ” It's all deep understanding LDA The basis of , But if no one helps beginners to get a general idea 、 Grasp the primary and secondary 、 ideas , It's easy to get into LDA In the details of ）, Or because there are so many mathematical details , As a result, I haven't read it completely .

2013 year 12 month , In my organization Machine Learning A Book Club The first 8 period On ,@ Summer powder _ Baidu Talk about the theory and Algorithm Research of sorting learning in machine learning ,@ Intoxicated 2011 Then we will talk about the understanding of the theme model . Once again, I came across the theme model , At that time, it seems that Shen Bo only told an example of Wang Feng writing lyrics , Still don't understand LDA What kind of thing is it （ But I understand LDA after , Look at Shen Bo's theme model again PPT It's going to be great ）.

Until yesterday afternoon , Machine learning class   The first 12 Next time in class , Zou finished LDA after , To really understand LDA It turned out to be such a thing ！ After class , Strike while the iron is hot , Look again LDA Math gossip , I found that the documents that I couldn't read before were all smoother when I read them again , See most of it in one breath . After reading most of them , The idea is clear , Know to understand LDA, It can be divided into the following 5 A step ：

1. A function ：gamma function
2. Four distributions ： The binomial distribution 、 Multiple distribution 、beta Distribution 、Dirichlet Distribution
3. A concept and an idea ： Conjugate prior and Bayesian framework
4. Two models ：pLSA、LDA（ In this article The first 4 part This paper ）
5. One sample ：Gibbs sampling

## 1 gamma function

### 1.0 Overall grasp LDA

About LDA There are two meanings , One is linear discriminant analysis （Linear Discriminant Analysis）, One is the probability theme model ： Implied Dirichlet distribution （Latent Dirichlet Allocation, abbreviation LDA）, This article talks about the latter .

in addition , Let me start with a few words LDA The whole idea of , Otherwise, I'm afraid you've been watching it for a long time , Too long a prelude , But still because I didn't see LDA The shadow of “ flighty and impetuous ”, So I don't want to continue to see . therefore , I'll give you a reassuring pill first , After understanding the overall framework , Let's step by step , Expand to discuss .

according to wiki Introduction on ,LDA from Blei, David M.、Ng, Andrew Y.、Jordan On 2003 in , It's a kind of Theme model , It can Set documents The topic of each document in is given in the form of probability distribution , So we can extract their themes by analyzing some documents （ Distribution ） After coming out , Then you can according to the theme （ Distribution ） Do topic clustering or text categorization . meanwhile , It is a typical word bag model , A document is made up of a set of words , There is no sequential relationship between words .

Besides , A document can contain multiple topics , Each word in the document is generated by one of the themes .

How do humans generate documents ？LDA These three authors give a simple example in the original paper . Suppose, for example, that these topics are given in advance ：Arts、Budgets、Children、Education, Then through learning and training , Get each topic Topic Corresponding words . As shown in the figure below ：

Then choose one of the above topics with a certain probability , Then choose a word under that topic with a certain probability , Keep repeating these two steps , Finally, an article is generated as shown in the figure below （ The words with different colors correspond to the words under different themes in the above figure ）：

And when we see an article , Often like to speculate how this article is generated , We may think that the author first identifies several themes of this article , Then choose words and sentences around these themes , Express in writing .

LDA That's what I want to do ： According to a given document , Back to its theme distribution .

Generally speaking , It can be assumed that Human beings have written various articles based on the above document generation process , Now a small group of people want computers to use LDA Do one thing. ： Your computer speculates and analyzes what topics each article on the network has written , And the probability of each topic in each article （ Theme distribution ） What is it .

however , It is such a seemingly ordinary LDA, At one time, it scared away many beginners who wanted to explore its internal principles . Where is the difficulty , Difficult is difficult LDA There are too many mathematical knowledge points involved .

stay LDA In the model , A document is generated as follows ：

• Distribution from Dirichlet Sampling to generate documents i The theme distribution of
• From the polynomial distribution of the subject Sampling to generate documents i The first j The theme of a word
• Distribution from Dirichlet Sample to generate theme The corresponding word distribution
• From the polynomial distribution of words Finally, the words are generated

among , similar Beta Distribution is the conjugate prior probability distribution of binomial distribution , And the Dirichlet distribution （Dirichlet Distribution ） Is the conjugate prior probability distribution of a polynomial distribution .

Besides ,LDA The graph model structure of is shown in the figure below （ similar Bayesian network structure ）：

Okay , Pretty good , In a matter of 6 The sentence sums up the whole LDA The subject thought of ！ But it's just short 6 Sentence , But the binomial distribution appears continuously or repeatedly 、 Polynomial distribution 、beta Distribution 、 Dirichlet distribution （Dirichlet Distribution ）、 Conjugate prior probability distribution 、 sampling , So, please. , What are these ？

Let's explain the binomial distribution 、 Multiple distribution 、beta Distribution 、Dirichlet  Distribute this 4 Distribution .

•   The binomial distribution （Binomial distribution）.

Binomial distribution is advanced from Bernoulli distribution . Bernoulli distribution , Also known as two-point distribution or 0-1 Distribution , It's a discrete random distribution , There are only two kinds of random variables , Non positive means negative {+,-}. And the binomial distribution is repeated n The second Bernoulli experiment , Write it down as . in short , Just one experiment , It's the Bernoulli distribution , I did it over and over again n Time , It's a binomial distribution . The probability density function of binomial distribution is ：

about k = 0, 1, 2, ..., n, Among them It's binomial coefficient （ This is the origin of the name of binomial distribution ）, It's also recorded as . Think back to the little probability knowledge that I learned in high school ： You must have memorized the binomial coefficient Namely .

• Multiple distribution , It's the case that binomial distribution extends to multi dimensions .

Multiple distribution means that the value of random variable in a single test is no longer 0-1 Of , But there are many discrete values possible （1,2,3...,k）. Like throwing 6 The experiment of dice on all sides ,N The results of this experiment are subject to K=6 The multiple distribution of . among

The probability density function of the multiple distribution is ：

• Beta Distribution , The conjugate prior distribution of binomial distribution .

The given parameters and , The value range is [0,1] Random variable of x The probability density function of ：

among ：

,

notes ： It's called gamma function , I'll elaborate on .

• Dirichlet Distribution , yes beta Spread in high dimensions .

Dirichlet The density function form of the distribution follows beta The density function of the distribution is the same ：

among

thus , We can see Binomial distribution and multiple distribution Very similar ,Beta The distribution and Dirichlet  Distribution Very similar , And as for Beta Distribution is the conjugate prior probability distribution of binomial distribution , And the Dirichlet distribution （Dirichlet Distribution ） Is the conjugate prior probability distribution of a polynomial distribution This is explained below .

OK, Next , Let's follow the idea at the beginning of this article ：“ A function ：gamma function , Four distributions ： The binomial distribution 、 Multiple distribution 、beta Distribution 、Dirichlet Distribution , Plus a concept and an idea ： Conjugate prior and Bayesian framework , Two models ：pLSA、LDA（ file - The theme , The theme - words ）, One sample ：Gibbs sampling ” Elaborate step by step , Try to give the reader a clear and complete LDA.

（ Of course , If you don't want to go into the details behind it , Just want to grasp the whole LDA The subject thought of , Skip to this article The first 4 part , After reading the 4 After part , If you want to go deep into the details behind the principle , You can go back here and start to see ）

### 1.1 gamma function

Let's consider a question first （ This problem 1 Including the following questions 2- problem 4 All from LDA Math gossip ）：

1. problem 1 A random variable
2. Put this n After sorting out random variables, we get order statistics
3. And then ask What is the distribution of .

To solve this problem , You can try to calculate Fall in the range [x,x+Δx] Probability . That is to find the value of the following formula ：

First , hold [0,1] The interval is divided into three sections [0,x),[x,x+Δx],(x+Δx,1], Then consider the simple case ： That is, the assumption n There is only 1 I'm in the zone [x,x+Δx] Inside , Because of the number in this range X(k) It's No k Big , therefore [0,x) There should be k−1 Number ,(x+Δx,1] There should be n−k Number . As shown in the figure below ：

So the problem turns into the following events E：

For the above events E, Yes ：

among ,o(Δx) Express Δx The higher order is infinitesimal . obviously , Because of different permutations , namely n One of the numbers falls on [x,x+Δx] There are n To plant and take , The remaining n−1 There are k−1 One falls on [0,x) There are Combinations of , So and events E There are equivalent events individual .

If there is 2 The number falls in the range [x,x+Δx] Well ？ As shown in the figure below ：

Something like an event E, about 2 The number falls in the range [x,x+Δx] Events E’：

Yes ：

From the above events E、 event E‘ in , It can be seen that , Just fall in [x,x+Δx] There is more than one number in , Then the probability of the corresponding event is o(Δx). So there is ：

To get The probability density function of by ：

thus , The questions raised at the beginning of this section have been resolved . But watch carefully The probability density function of , It is found that the final result of the formula is factorial , Think of the extension of factorials in real numbers function ：

Will the combination of the two have a wonderful effect ？ in consideration of Has the following properties ：

So it will be substituted into The probability density function of in , Available ：

And then take ,, transformation obtain ：

If you are familiar with beta Distributed friends , May exclaim ： wow , Actually launched beta Distribution ！

## 2 beta Distribution

### 2.1 beta Distribution

In probability theory ,beta A group defined in Continuous probability distribution of interval , There are two parameters and , And .

beta The probability density function of the distribution is ：

Among them That is function ：

A random variable X Obey the parameter beta Distribution usually writes ：.

### 2.2 Beta-Binomial conjugate

In retrospect 1.1 The question raised at the beginning of the stanza ：“ problem 1 A random variable , Put this n After sorting out random variables, we get order statistics , And then ask What is the distribution of .” If , Let's add some observation data to this problem , become problem 2

• , The corresponding order statistic is , Need to guess ;
• There is Ratio p Small , Ratio Big ;
• that , Excuse me, What is the distribution of .

according to “Yi There is Ratio Small , Ratio Big ”, In other words ,Yi There is Ratio Small , Ratio Big , therefore yes pass the civil examinations Large number .

according to 1.1 The final conclusion of stanza “ Just fall in [x,x+Δx] There is more than one number in , Then the probability of the corresponding event is o(Δx)”, And then push event obedience beta Distribution , Thus it The probability density function of is ：

Be familiar with Bayesian methods （ Nothing unfamiliar , See This article The first part ） My friend's heart is estimated to be guilty of “ Mutter ” 了 , This is a Bayesian thinking process ？

1. To guess , Before obtaining certain observation data , We are right. My understanding of ：, This is called Prior distribution ;
2. And then in order to get this result “  There is Ratio p Small , Ratio Big ”, in the light of It's done The second Bernoulli experiment , therefore Obey binomial distribution ;
3. Given the data provided from After knowledge , The posteriori distribution of becomes .

In retrospect The fixed pattern of Bayesian thinking ：

• Prior distribution  + Sample information    Posterior distribution

The above thinking pattern means , The newly observed sample information will modify people's previous cognition of things . In other words , Before getting new sample information , People are right. Our cognition is a prior distribution , Getting new sample information after , People are right. The cognition of is .

Analogy to the present problem , We can also try to write down ：

among It's a binomial distribution Count of .

A more general , For nonnegative real numbers and , We have the following relations

For this The observed data fit the binomial distribution , The prior distribution and the posterior distribution of parameters are Beta Distribution The situation of , Namely Beta-Binomial conjugate . In other words ,Beta Distribution is the conjugate prior probability distribution of binomial distribution .

Binomial distribution and Beta Distribution is a conjugate distribution which means , If we are the parameter of binomial distribution p The prior distribution chosen is Beta Distribution , So in order to p The posterior distribution obtained by Bayesian estimation for the binomial distribution of parameters still obeys Beta Distribution .

Besides , How to understand parameters and What's the meaning of the expression ？ You can think of shape parameters , The popular but not strict understanding is , and Joint control Beta Function of distribution “ Looks like ”： There are many strange shapes , Tall and thin , As shown in the figure below ：

### 2.3 Conjugate prior distribution

What is conjugation ？ The yoke means to bind 、 control , Conjugation is understood literally , It's a common constraint , Or each other .

In Bayesian probability theory , If the posterior probability P(θ|x) And a priori probability p(θ) Satisfy the same law of distribution , that , Prior distribution and posterior distribution are called conjugate distribution , meanwhile , The prior distribution is called the conjugate prior distribution of likelihood function .

such as , Some observation data obey probability distribution P(θ) when , When new X Data time , We usually encounter the following problems ：

• According to the new observation data X, Update parameters θ？
• To what extent parameters can be changed based on new observations θ, namely

• When we reevaluate θ When , Give the new parameter value θ The new probability distribution of , namely P(θ|x).

in fact , According to the Bayes formula ：

among ,P(x|θ) To express in anticipation θ For the parameter of x A probability distribution , It can be found directly ,P(θ) It's original θ A probability distribution .
therefore , If we choose P(x|θ) Conjugate prior as of P(θ) The distribution of , that P(x|θ) multiply P(θ), And then the normalized result P(θ|x) Follow and P(θ) In the same way . let me put it another way , The prior distribution is P(θ), The posterior distribution is P(θ|x), The prior distribution and the posterior distribution belong to the same distribution family , So the distribution group is θ The conjugate prior distribution of （ family ）.

for instance . Throw a non-uniform coin , You can use the parameter θ The Bernoulli model of ,θ The probability that the coin is positive , So the result x The distribution form of is ：

Its conjugate prior is beta Distribution , There are two parameters and , It's called super parameter （hyperparameters）. And these two parameters determine θ Parameters , Its Beta The distribution form is

And then calculate the posterior probability

Normalize this equation to get another Beta Distribution , So we can prove that Beta The distribution is really the conjugate prior distribution of Bernoulli distribution .

### 2.4 from beta The distribution extends to Dirichlet Distribution

Next , Let's investigate beta A property of distribution .

If , Then there are ：

Notice the right integral of the final result of the above formula

It's similar to the probability distribution , And for this distribution there are

So as to obtain

As the result of the

Finally, bring this result into The formula of , obtain ：

The final result means that for Beta Random variable of distribution , The average is （ expect ） It can be used To estimate . Besides , Dirichlet Dirichlet The distribution has a similar conclusion , That is, if , It can also be proved that the following conclusions are true ：

What is Dirichlet Distribution ？ Simple understanding Dirichlet Distribution is a set of continuous multivariate probability distribution , It's multivariable beta Distribution . In memory of the German mathematician John · Peter · Gustaf · Legena · Dirichlet （Peter Gustav Lejeune Dirichlet） And name . Dirichlet distribution is often used as a prior probability of Bayesian statistics .

## 3 Dirichlet Distribution

### 3.1 Dirichlet Distribution

according to wikipedia Introduction on , dimension K ≥ 2（x1,x2…xK-1 dimension , common K individual ） The Dirichlet distribution of parameters α1, ..., αK > 0 On 、 Based on Euclidean space RK-1 There is a probability density function in leberg measure , Defined as ：

among , Quite a lot beta function

And

Besides ,x1+x2+…+xK-1+xK=1,x1,x2…xK-1>0, And in (K-1) On the simplex of dimension , The probability density of other areas is 0.

Of course , It can also be defined as Dirichlet Distribution

Among them be called Dirichlet Normalized coefficient of distribution ：

And according to Dirichlet The integral of the distribution is 1（ The basic nature of probability ）, You can get ：

### 3.2 Dirichlet-Multinomial conjugate

below , stay 2.2 Section question 2 On the basis of this, we will continue to deepen , extraction problem 3.

• ,
• The corresponding order statistics after sorting ,
• ask What is the joint distribution of ？

To simplify the calculation , take x3 Satisfy x1+x2+x3=1, But only x1,x2 It's a variable. , As shown in the figure below ：

Thus there are ：

Then we get and we get The joint distribution of is ：

Observe the final result of the above formula , It can be seen that the above distribution is actually 3 In the form of Dirichlet Distribution

Make , So the distribution density can be written as

This is the general form of 3 dimension Dirichlet Distribution , Even if Extend to a set of nonnegative real numbers , The above probability distribution is also well defined .

take Dirichlet The probability density function of the distribution is logarithmic , Draw symmetry Dirichlet The image of the distribution is shown in the figure below （ Intercepted from wikipedia On ）：

Above picture , take K=3, There are two independent parameters x1,x2, They correspond to the two axes in the graph , The third parameter always satisfies x3=1-x1-x2 And α1=α2=α3=α, The figure shows the parameters α from α=(0.3, 0.3, 0.3) Change to (2.0, 2.0, 2.0) The change of the probability of time to the value .

For the sake of argument Dirichlet Distribution Is the conjugate prior probability distribution of a polynomial distribution , Let's continue with the above questions 3 On the basis of that , Put forward problem 4.

1. problem 4  , The corresponding order statistics after sorting
2. Make ,,（ Here p3 Non variable , Just for the convenience of expression ）, Now guess ;
3. ,Yi Fall to ,,  The number of three intervals is m1,m2,m3,m=m1+m2+m3;
4.   Ask posterior distribution What is the distribution of .

For the convenience of discussion , remember , And , According to known conditions “,Yi Fall to ,,  The number of three intervals is m1,m2”, Available The difference is this m+n The number of Big 、 The first Large number . therefore , Posterior distribution Should be , That is to say, the generalized form is ：.

alike , According to the logic of Bayesian reasoning , The above process can be sorted out as follows ：

1.   We have to guess the parameters , Its prior distribution is ;
2.   data Yi Fall into three sections ,,  The number of is , therefore Obey multiple distribution
3.   Given the knowledge provided from the data after , The posteriori distribution of becomes

The intuitive expression of the above Bayesian analysis process is ：

Make , Can handle From the set of integers to the set of real numbers , The more general expression is as follows ：

For this The observed data fit the multiple distribution , The prior distribution and the posterior distribution of parameters are Dirichlet Distribution The situation of , Namely Dirichlet-Multinomial conjugate . In other words , So far it has been proved that Dirichlet The distribution is really Is the conjugate prior probability distribution of a polynomial distribution .

signify , If we are a parameter of a multiple distribution p The prior distribution chosen is Dirichlet Distribution , So in order to p The posterior distribution obtained by Bayesian estimation for the multiple distribution of parameters still obeys Dirichlet Distribution .

further , In general form Dirichlet The distribution is defined as follows ：

And for a given and , Its multiple distribution is ：

The conclusion is that ：Dirichlet Distribution And multiple distribution It's conjugation .

## 4 Theme model LDA

Before starting the following journey , First of all, let's summarize some of the most important achievements we have achieved at present ：

• By No 2.2 section , We know beta Distribution Is the conjugate prior probability distribution of binomial distribution ：
•   For nonnegative real numbers and , We have the following relations

among It's a binomial distribution Count of . For this kind of observed data, it conforms to binomial distribution , The prior distribution and the posterior distribution of parameters are Beta Distribution , Namely Beta-Binomial conjugate .

• By 3.2 section , We know the Dirichlet distribution （Dirichlet Distribution ） Is the conjugate prior probability distribution of a polynomial distribution ：
•   hold From the set of integers to the set of real numbers , The more general expression is as follows ：

For this kind of observed data, it is consistent with multiple distribution , The prior distribution and the posterior distribution of parameters are Dirichlet Distribution , Namely Dirichlet-Multinomial conjugate . ”

• And the fixed pattern of Bayesian thinking ：
• Prior distribution  + Sample information    Posterior distribution

The above thinking pattern means , The newly observed sample information will modify people's previous cognition of things . In other words , Before getting new sample information , People are right. Our cognition is a prior distribution , Getting new sample information after , People are right. The cognition of is .

• By the way Frequency school and Bayesian school have different ways of thinking ：
• The frequency faction puts the parameters that need to be inferred θ As a fixed unknown constant , That's probability Although unknown , But at least a certain value , meanwhile , sample X Is random , So the frequency school focuses on the sample space , Most of the probability calculations are for samples X The distribution of ;
• and Bayesian school On the contrary , What they think is to be estimated Parameters It's a random variable , Obey a certain distribution , And the sample X Is constant , Because the sample is fixed , So they focus on parameters The distribution of .

OK, In the end boss——LDA Before the model , Then understand the basic model step by step ：Unigram model、mixture of unigrams model, And with LDA The closest pLSA Model .

For ease of description , First define some variables ：

• To express words , Indicates the number of all words （ Fixed value ）
• Show the theme , It's the number of themes （ Give in advance , Fixed value ）
• Means corpus , Among them Is the number of documents in the corpus （ Fixed value ）
• Represents a document , Among them Represents the number of words in a document （ A random variable ）

### 4.1 Each basic model

4.1.1 Unigram model

For documents , use To express words The prior probability of , Generating documentation The probability of is ：

Its graph model is （ Painted in the picture w Represents an observable variable ,N Represents a total of N Word ,M Express M document ）：

Or for ：

unigram model Suppose the words in the text obey Multinomial Distribution , And we already know Multinomial The prior distribution of the distribution is Dirichlet Distribution .
In the picture above Indicates the... Observed in the text n Word ,n∈[1,N] Indicates that there are N Word . Add a box to indicate repetition , That is, there are N A random variable like this . among ,p and α It's an implied unknown variable ：

• p Words obey Multinomial Parameters of the distribution
• α yes Dirichlet Distribution （ namely Multinomial A prior distribution of distribution ） Parameters of .

commonly α Given beforehand by experience ,p Learn from the words in the observed text , Indicates the probability of each word appearing in the text .

4.1.2 Mixture of unigrams model

The generation process of this model is ： Choose a theme for a document first , Then generate the document according to the topic , All the words in the document come from a theme . Suppose the subject has , Generating documentation The probability of is ：

Its graph model is （ Painted in the picture w Represents an observable variable , Not colored z Represents an unknown hidden variable ,N Represents a total of N Word ,M Express M document ）：

### 4.2 PLSA Model

Ah , Twenty five thousand on the long march , After the long matting in front , At last, we are approaching LDA Model ！ because Follow LDA The closest model is the following one pLSA Model , I understand pLSA After the model , To LDA The model is just one step away —— to pLSA Plus the Bayesian framework , That is LDA.

4.2.1 pLSA Generate documents under the model

OK, Above Mixture of unigrams model in , Let's assume that a document has only one theme generated , But in practice , An article often has multiple themes , It's just that the probability of each topic appearing in the document is different . For example, in the document of introducing a country , Often separately from education 、 economic 、 Transportation and other topics . So in pLSA in , How are documents generated

Suppose you want to write M document , Because a document is made up of different words , So you need to identify the words in every position in every document .

Let's suppose you have K An optional theme , Yes V An optional word , Let's play a game of dice .

• 1. Suppose you Every time you write a document, you make one K Face “ file - The theme ” dice （ Throw this dice to get K Any one of the themes ）, and K individual V Face “ The theme - Term ” dice （ Each die corresponds to a theme ,K One dice corresponds to the previous one K A theme , And every face of the dice should choose the words ,V Face to face V An optional word ）.
• For example, we can make K=3, That is, making 1 One contains 3 A theme “ file - The theme ” dice , this 3 A theme can be ： education 、 economic 、 traffic . Then make V = 3, Make 3 One has 3 Face “ The theme - Term ” dice , among , Educational theme of dice 3 The individual words can be ： university 、 teacher 、 Course , The economic theme of dice 3 The individual words can be ： market 、 Enterprises 、 Finance , Traffic theme dice 3 The individual words can be ： high-speed rail 、 automobile 、 The plane .

• 2. Every time I write a word , Throw it first “ file - The theme ” Dice choose the theme , After getting the result of the theme , Use the one corresponding to the topic result “ The theme - Term ” dice , Throw the dice to choose the words to write .
• Throw it first “ file - The theme ” The dice of , hypothesis （ With a certain probability ） The theme is education , So the next step is to throw away the education theme sieve ,（ With a certain probability ） Get a word corresponding to the educational theme sieve ： university .
• The above process of dice production is simplified to ：“ First, choose the topic with a certain probability , Then choose words with a certain probability ”. in fact , At the beginning, there are themes to choose from 3 individual ： education 、 economic 、 traffic , So why choose the theme of education ？ In fact, it is randomly selected , But this random follows a certain probability distribution . For example, the probability of choosing an educational topic is 0.5, The probability of choosing an economic theme is 0.3, The probability of choosing a traffic theme is 0.2, that this 3 The probability distribution of each topic is { education ：0.5, economic ：0.3, traffic ：0.2}, We put Various themes z In the document d The probability distribution that appears in be called Theme distribution , And it's a multiple distribution .
• alike , After randomly selecting educational themes from the theme distribution , Still face 3 Word ： university 、 teacher 、 Course , this 3 Every word can be chosen , But the probability of their being selected is different . For example, the probability of university being chosen is 0.5, The probability of teacher being chosen is 0.3, The probability of a course being chosen is 0.2, that this 3 The probability distribution of a word is { university ：0.5, teacher ：0.3, Course ：0.2}, We put Every word w stay The theme z Next Probability distribution of occurrence be called Word distribution , The word distribution is also a multiple distribution .
• therefore , Topic selection and word selection are two random processes , First from the theme distribution { education ：0.5, economic ：0.3, traffic ：0.2} Take out the theme ： education , Then from the distribution of words corresponding to the educational theme { university ：0.5, teacher ：0.3, Course ：0.2} To extract words from ： university .
• 3. Last , You keep throwing “ file - The theme ” Dice and ” The theme - Term “ dice , repeat N Time （ produce N Word ）, Complete a document , Repeat this method of generating a document M Time , Then finish M document .

Above The abstraction of the process is PLSA The document generation model of . In the process , We don't pay attention to the order between words , therefore pLSA It's a word bag method . say concretely , The model assumes a set of co-occurrence (co-occurrence) Word items are associated with an implied topic category . At the same time define ：

• Indicates the probability of a document being selected in a large number of documents .
• To express words In a given document The probability of occurrence in .
• How to calculate it ？ For massive documents , After segmenting all documents , Get a list of words , So each document is a collection of words . For each word , Dividing the number of times it appears in the document by the total number of words in the document is the probability of its occurrence in the document .
• To express a specific theme In a given document The probability of the next occurrence .
• To express a specific word In a given topic The probability of the next occurrence , The more closely related to the theme , Its conditional probability The bigger it is .

Use the above 1、3、4 A probability , We can then follow the steps below to get “ file - Term ” The generation model of ：

1. According to the probability Choose a document
2. Select document after , From the subject distribution, according to probability Choose an implicit topic category
3. selected after , From word distribution according to probability Choose a word

therefore pLSA The whole process of generating documents in is to select the theme of generating documents , Determine the theme generating words .

4.2.1 According to the document, we can deduce its theme distribution

In turn, , Now that the document has been generated , So how to deduce the theme based on the generated good documents ？ This uses the documents you see to infer their hidden themes （ Distribution ） The process of （ In fact, it is the reverse process of generating documents ）, It's the purpose of theme modeling ： Automatically discover topics in the document set （ Distribution ）.

In other words , Human beings have written various articles based on the document generation model , And then it's lost to Computer , What the computer sees is an article that has been written . Now the computer needs to sum up the theme of an article according to a series of words seen in an article , Then we get the different probability of each topic ： Theme distribution . The document d And the words w Is observable , But the theme z But hidden .

As shown in the figure below （ Painted in the picture d、w Represents an observable variable , Not colored z Represents an unknown hidden variable ,N Represents a total of N Word ,M Express M document ）：

Above picture , file d And words w It's the sample we got （ Sample randomised , Parameters are unknown but fixed , therefore pLSA Belongs to the frequency school . Different from the following LDA in ： Sample fixation , Parameter unknown but not fixed , It's a random variable , Obey a certain distribution , therefore LDA Belongs to the Bayesian school ）, Observable result , therefore For any document , Its It is known. .

So that we can according to A lot of known documents - Item information , Train to produce documents - The theme and The theme - Term , As shown in the following formula ：

Therefore, the generation probability of each word in the document is ：

because It can be calculated in advance , and and Unknown , therefore That's what we're going to estimate （ value ）, Popular point theory , To maximize this θ.

How to estimate it , The commonly used parameter estimation methods are maximum likelihood estimation MLE、 Maximum post validation estimate MAP、 Bayes estimates and so on . Because the parameters to be estimated contain hidden variables z, So we can think about EM Algorithm .

4.2.1.1 EM A brief introduction to the algorithm

EM Algorithm , Its full name is Expectation-maximization algorithm, For the expected maximum algorithm , The basic idea is this ： First, randomly select a value to initialize the value to be estimated , Then iterate to find the better Make its likelihood function likelihood  More than the original Be big . In other words , Let's say that now we have , Want to ask for , bring

EM The key is to find A lower bound of （ notes ：, among ,X Represents a random variable that has been observed ）, And then constantly maximize this lower bound , By constantly solving the lower bound The maximization of , So we can approach the likelihood function of the required solution .

therefore EM The general steps of the algorithm are ：

• 1. Randomly selected or initialized according to prior knowledge ;
• 2. Repeat the following two steps
• ① Give the current parameter estimate , Calculate the likelihood function Lower bound of
• ② Reevaluate parameters θ, The o , bring
• 3. After the second step , If convergence （ namely convergence ） Then exit the algorithm , Otherwise, go back to the second step .

The above process is like in a two-dimensional plane , There are two curves that don't intersect , A curve is on （ It's called upper curve for short ）, A curve is under （ For short, the curve ）, The lower curve is the lower bound of the upper curve . Now I don't know about the curve , Only the lower curve is known , In order to find the highest point of the curve , We're trying to increase the bottom curve , Make the lower curve approach the upper curve continuously , The lower curve reaches the local maximum at a certain point and is equal to the value of the upper curve at this point , Record the value , Then continue to increase the lower curve , Look for the same value on the lower curve as on the upper curve , Iterate to convergence （ namely convergence ） stop it , Thus, the local maximum value of the current lower curve is taken as the global maximum value of the upper curve （ In other words ,EM The algorithm does not guarantee to find the global optimal value ）. As shown in the figure below ：

Here is a detailed introduction .

Suppose there is a training set , contain m Individual samples , Hope to find the model of this group of data p(x,z) Parameters of .

Then the objective function is established by maximum likelihood estimation -- Log likelihood function ：

here ,z It's a hidden random variable , It is very difficult to find the estimation of parameters directly . Our strategy is to build Lower bound of , And find the maximum value of the lower bound ; Repeat the process , Until it converges to the local maximum .

Make Qi yes z A certain distribution of ,Qi≥0, And combine Jensen inequality , Yes ：

In order to find the tightest lower bound , We can make the above equal sign hold , And if you want the equal sign to hold, the condition is ：

In other words , There is the following formula ：, And because of ：

So you can get it. ：

The resulting EM The overall framework of the algorithm is as follows ：

OK,EM The algorithm will be explained in the blog post later . Next , go back to pLSA On the estimation of parameters .

4.2.1.2 EM Algorithm estimate pLSA Two unknown parameters of

First try Describe two unknown variables to be estimated from the perspective of matrix and .

• Suppose to use Means a glossary In the theme A multiple distribution on , be It can be expressed as a vector , Every element A word item Appears in the theme The probability of , namely

• use Represent all themes In the document A multiple distribution on , be It can be expressed as a vector , Every element Show the theme Appears in the document The probability of , namely

such , Clever handle and It's converted into two matrices . In other words , Finally, the parameters we need to solve are these two matrices ：

Because words and words are independent of each other , So the whole document N The distribution of words is ：

Again because Documents and documents are also independent of each other , So the distribution of words in the whole corpus is （ The whole corpus M document , Every document N Word ）：

among , A word item In the document The frequency of words in , Represents a document di The total number of middle words , Obviously there is .
So we can get the log likelihood function of the word distribution of the whole corpus （ There is a small mistake in the following formula , The right one is ：N by M,M by N）：

Now? , We need to maximize the above logarithmic likelihood function to solve the parameter and . For this maximum likelihood estimation with hidden variables , have access to EM Algorithm .EM Algorithm , There are two steps ： First E-step, after M-step.

• E-step： Suppose the parameters are known , Calculate the posterior probability of the hidden variable .

Using Bayes rule , You can get ：

• M-step： Take the posterior probability of the hidden variable , To maximize the log likelihood function of the sample distribution , Solve the corresponding parameters .

Look at the log likelihood function that we got before Result , because Document length It can be calculated separately , So removing it doesn't affect the maximum likelihood function . Besides , according to E-step Calculated results of , hold Plug in , So we just need to maximize the following function    that will do （ There is a small mistake in the following formula , The right one is ：N by M,M by N）

This is an extreme value problem of multivariate function , And the following constraints are known （ There is a small mistake in the following formula , The right one is ：M by N）

Friends familiar with convex optimization should know , In general, we deal with the extremum problem with constraints , The usual method is the Lagrange multiplier method , That is, by introducing Lagrange multipliers, constraints and variables （ The goal is ） Functions merge together , It can be transformed into an unconstrained extremum problem .

Here we introduce two Lagrange Multiplier and , So I can write the Lagrange function （ There is a small mistake in the following formula , The right one is ：N by M,M by N）

Because we want the parameter of the solution to be and , So they were right and Finding partial derivatives , Then let the partial derivative be equal to 0, obtain （ There is a small mistake in the following formula , The right one is ：N by M,M by N）

Eliminate the Lagrange multiplier , Finally, the parameters can be estimated and （ There is a small mistake in the following formula , The right one is ：N by M,M by N）

Sum up , stay pLSA in ：

1. because and Unknown , therefore We use it EM Algorithm to estimate This The value of the parameter .
2. Then , use A word item Appears in the theme The probability of , namely , use Show the theme Appears in the document The probability of , namely , So that Transformed into “ The theme - Term ” matrix Φ（ Theme generating words ）, hold Transformed into “ file - The theme ” matrix Θ（ Document generation theme ）.
3. Finally, we find out .

### 4.3 LDA Model

in fact , I understand pLSA Model , It's almost understood LDA Model , because LDA Is in the pLSA On the basis of layer Bayesian framework , namely LDA Namely pLSA Bayes version of （ It is because LDA It's Bayesian , So we need to consider historical prior knowledge , Two prior parameters added ）.

4.3.1 pLSA Follow LDA Comparison of ： Generate documents and parameter estimates

stay pLSA In the model , We Follow the steps below to get “ file - Term ” The generation model of ：

1. According to the probability Choose a document
2. Select document after , Determine the theme distribution of the article
3. From the subject distribution, according to probability Choose an implicit topic category
4. selected after , Determine the word distribution under the theme
5. From word distribution according to probability Choose a word

below , Let's compare what we said at the beginning of this article LDA How is a document generated in the model ：

1. According to a priori probability Choose a document
2. Distribution from Dirichlet （ namely Dirichlet Distribution ） Sampling to generate documents The theme distribution of , In other words , Theme distribution From the super parameter to Of Dirichlet Distribution generation
3. From the polynomial distribution of the subject Sampling to generate documents The first j The theme of a word
4. Distribution from Dirichlet （ namely Dirichlet Distribution ） Sample to generate theme The corresponding word distribution , In other words , Word distribution The parameter is Of Dirichlet Distribution generation
5. From the polynomial distribution of words Finally, the words are generated

As can be seen from the above two processes ,LDA stay PLSA On the basis of , Add two... For theme distribution and word distribution Dirichlet transcendental .

Go ahead and explain PLSA The example of . As mentioned earlier , stay PLSA in , Topic selection and word selection are two random processes , First from the theme distribution { education ：0.5, economic ：0.3, traffic ：0.2} Take out the theme ： education , Then from the corresponding word distribution of the subject { university ：0.5, teacher ：0.3, Course ：0.2} To extract words from ： university .

And in the LDA in , Topic selection and word selection are still two random processes , It's still possible to start with theme distribution { education ：0.5, economic ：0.3, traffic ：0.2} Take out the theme ： education , Then from the corresponding word distribution of the subject { university ：0.5, teacher ：0.3, Course ：0.2} To extract words from ： university .

that PLSA Follow LDA The difference lies in where ？ The difference is that ：

• PLSA in , Topic distribution and word distribution are unique affirmatory , To be able to clearly point out the theme distribution may be { education ：0.5, economic ：0.3, traffic ：0.2}, Word distribution may be { university ：0.5, teacher ：0.3, Course ：0.2}.
• But in LDA in , Topic distribution and word distribution are no longer unique , namely It is impossible to give a definite answer to . For example, the theme distribution might be { education ：0.5, economic ：0.3, traffic ：0.2}, It could be { education ：0.6, economic ：0.2, traffic ：0.2}, We're not sure what it is （ I don't know ）, Because it's random and changeable . But how to change , And still obey a certain distribution , That is, theme distribution and word distribution Dirichlet A priori random determination .

See this , You may be in a mess , You say facing multiple themes or words , The probability of each topic or word being selected is different , So the topic or word is randomly selected , It's easy to understand . But now you say that topic distribution and word distribution are also uncertain , What's going on ？ Can't , Who's calling Blei wait forsomeone “ Force ” to PLSA We have a Bayesian framework , It is because LDA yes PLSA Bayes version of , So topic distribution and word distribution are randomly given by prior knowledge .

further , You'll find that ：

• pLSA in , After the theme distribution and word distribution are determined , With a certain probability （） Select the specific theme and word item respectively , Generate good documents . Then, according to the generated documents, we can deduce the theme distribution 、 When words are distributed , Final use EM Algorithm （ Maximum likelihood estimation idea ） Two unknowns have been solved, but fixed The value of the parameter ：（ from It's a transformation ） and （ from It's a transformation ）.
• file d Generate themes z Probability , The theme z Produce words w The probabilities of are two fixed values .
• Take a document d Generate themes z Example . Given a document d, The theme distribution is certain , such as { P(zi|d), i = 1,2,3 } It could be {0.4,0.5,0.1}, Express z1、z2、z3, this 3 Themes are documented d The probability of selection is a fixed value ：P(z1|d) = 0.4、P(z2|d) = 0.5、P(z3|d) = 0.1, As shown in the figure below （ Screenshot from Shen Bo PPT On ）：

• But in the Bayesian framework LDA in , We No longer Think Theme distribution （ The probability distribution of each topic in the document ） And word distribution （ The words are in Under a certain theme Probability distribution of occurrence ） It's the only certain （ It's a random variable , But there are many possibilities . But a document must correspond to a topic distribution and a word distribution , What shall I do? ？LDA Two for them Dirichlet A priori parameter , This Dirichlet A priori is Some article The document randomly extracts a topic distribution and word distribution .
• file d Generate themes z（ Accurately speaking , It's actually Dirichlet A priori is a document d Generate theme distribution Θ, Then according to the theme distribution Θ Generate themes z） Probability , The theme z Produce words w The probabilities of both are no longer certain values , It's a random variable .
• Let's lift the document again d Specific themes z Example . Given a document d, Now there are several themes z1、z2、z3, Their Theme distribution { P(zi|d), i = 1,2,3 } May be {0.4,0.5,0.1}, It could be {0.2,0.2,0.6}, namely These themes are d The probability of selection is no longer considered a certain value , May be P(z1|d) = 0.4、P(z2|d) = 0.5、P(z3|d) = 0.1, It could be P(z1|d) = 0.2、P(z2|d) = 0.2、P(z3|d) = 0.6 wait , We are not sure which value set the theme distribution is （ Why? ？ This is the core idea of Bayesian school , Treat unknown parameters as random variables , No longer considered a certain value ）, But its prior distribution is dirichlet Distribution , So you can use... From an infinite number of topic distributions dirichlet A priori random Extract a topic and distribute it . As shown in the figure below （ Screenshot from Shen Bo PPT On ）：

In other words ,LDA stay pLSA On the basis of these two parameters Two parameters of prior distribution are added （ Bayesization ）： A priori distribution of a topic distribution Dirichlet Distribution , And the prior distribution of a word distribution Dirichlet Distribution .

Sum up ,LDA It's really just pLSA Bayes version of , After document generation , Both of them should infer the theme distribution and word distribution according to the documents （ That is to say, both of them are designed to estimate the generation topic of a given document , The probability of generating words for a given topic ）, It's just that the parameter inference method is different , stay pLSA The maximum likelihood estimation is used to infer two unknown fixed parameters , and LDA Then make these two parameters into random variables , And add dirichlet transcendental .

therefore ,pLSA Follow LDA The essential difference is that they use different ideas to estimate unknown parameters , The former uses the idea of frequency school , The latter uses the idea of Bayesian school .
like , I went to a friend's house ：

• According to the idea of frequency school , I estimate that the probability of his being at home is 1/2, The probability of not being at home is 1/2, It's a fixed value .
• and According to the idea of Bayes , The probability that he is not at home is no longer considered a fixed value 1/2, It's a random variable . For example, according to our experience （ For example, at the end of the day ）, What's the probability of his being at home 0.6, But this 0.6 Either it's totally certain , It could be 0.7. such , Bayes can't give the exact value of parameters （0.3,0.4,0.6,0.7,0.8,0.9 It's possible ）, But at least understand In which range or which values （0.6,0.7,0.8,0.9） More likely , Which range or value （0.3,0.4） Is unlikely to . further , In Bayesian estimation , Multiple estimates of the parameters follow a certain prior distribution , Then according to the data obtained from practice （ For example, I keep running at his home on weekends ）, Constant correction of previous parameter estimates , From prior distribution to posterior distribution .

OK, I believe it has been explained clearly . If it's in machine learning class face-to-face, Better explanation and communication .

4.3.2 LDA Further understanding of the document generation process

Above said ,LDA in , Theme distribution  ——  such as { P(zi), i =1,2,3 } be equal to {0.4,0.5,0.1} or {0.2,0.2,0.6} ——  By dirichlet A priori given , Not based on documents . therefore ,LDA In the process of generating documents , First from dirichlet A priori “ Random ” Extract the theme distribution , Then from the theme distribution “ Random ” Take out the theme , Finally, from the word distribution corresponding to the determined theme “ Random ” Extract words .

that ,dirichlet What is the priori “ Random ” Extract the theme distribution ？

in fact , from dirichlet In the distribution, the topic distribution is randomly selected , This process is not completely random . To make it clear , Let's go back to dirichlet Distribution . in fact , If we take 3 An event , You can build a three-dimensional coordinate system , similar xyz Three dimensional coordinate system , here , We put 3 I've got a coordinate axis for p1、p2、p3, As shown in the figure below ：

In the space divided by this three-dimensional axis , Every coordinate point (p1,p2,p3) It corresponds to a theme distribution , And a certain point (p1,p2,p3) The size of represents 3 A theme z1、z2、z3 The probability of occurrence （ Because the sum of the probabilities of each topic is 1, therefore p1+p2+p3 = 1, And p1、p2、p3 this 3 The maximum value of each point is 1）. such as (p1,p2,p3) = (0.4,0.5,0.1) It corresponds to the theme distribution { P(zi), i =1,2,3 } = {0.4,0.5,0.1}.

You can imagine , There are many such points in the space (p1,p2,p3), It means there are many themes to choose from , that dirichlet How to choose theme distribution ？ Put down the diagonal triangle above , Map to the plane of the base , Then we get some color pictures as shown below （3 In a color picture , Each point corresponds to a theme distribution , The height represents the distribution of a subject dirichlet Probability of distribution selection , And Choose a different ,dirichlet The distribution tends to be different ）：

Let's look at the picture on the left , Height stands for dirichlet Distribution select a coordinate point (p1,p2,p3)（ This point is a theme distribution ） The probability of . As shown in the figure below , A point on three vertices of a projected triangle ：A=(0.9,0.05,0.05)、B=(0.05,0.9,0.05)、C=(0.05,0.05,0.9) The respective theme distribution is dirichlet The probability of distribution selection is very high , And the two points inside the plane triangle ：D、E The corresponding theme distribution is dirichlet The probability of distribution selection is very small .

So just say dirichlet The distribution is a random selection of any topic distribution , But there is still P(A) = P(B) = P(C) >> P(D) = P(E), namely dirichlet Distribution or “ A preference for ” Some themes are distributed . as for dirichlet Parameters of the distribution How to decide dirichlet In the shape of a distribution , It can be downloaded from dirichlet Thinking about the definition and formula of distribution .

Besides , Let's just say “ Random ” The theme selection is also based on the theme distribution “ Random ” selection , Random here is not exactly random , It is based on the probability value of each topic . For example, when dirichlet A priori is a document d Generated theme distribution { P(zi), i =1,2,3 } yes {0.4,0.5,0.1} when , So the theme z2 In the document d The probability of occurrence in 0.5. therefore , Extract theme from theme distribution , This process is not completely random , Instead, it is extracted according to the probability value of each topic .

4.3.3 pLSA Follow LDA Comparison of probability maps of

Next , In contrast LDA Follow pLSA The probability model graph model of , On the left is pLSA, The picture on the right is LDA（ The picture on the right is not very standard ,z Follow w All in lowercase ,  among , Shadow circles represent observable variables , Non shadow circles represent hidden variables , The arrows indicate the conditional dependence between the two variables conditional dependency, Box represents repeated sampling , The number in the lower right corner of the box represents the number of repeated samples ）：

It corresponds to LDA, Only W / w Is the observed variable , Others are hidden variables or parameters , among ,Φ The distribution of words ,Θ Show theme distribution ,  It's the theme distribution Θ Prior distribution （ namely Dirichlet Distribution ） Parameters of , It's word distribution Φ Prior distribution （ namely Dirichlet Distribution ） Parameters of ,N Represents the total number of words in the document ,M Represents the total number of documents .

therefore , For a document d Every word in ,LDA According to prior knowledge Determine the theme distribution of a document θ, Then from the document corresponding to multiple distribution （ Theme distribution ）θ Take a theme z, Then according to prior knowledge Determine the word distribution of the current topic ϕ, Then from the theme z The corresponding multiple distribution （ Word distribution ）ϕ Take a word from w. Then repeat the process N Time , It's a document d.

In other words ：

1. Suppose that there are M An article , Under every article Topic Of Theme distribution It's a The parameter is Of Dirichlet Sampling in the prior distribution Of Multinomial Distribution , Every Topic Under the Word distribution It's a The parameter is Of Dirichlet Sampling in the prior distribution Of Multinomial Distribution .
2. For the article No n Word , First of all, from each topic that appears in the article Multinomial Distribution （ Theme distribution ） Select or sample a topic , And then on this subject Corresponding Lexical Multinomial Distribution （ Word distribution ） Choose or sample a word . Repeat the random generation process , until M All articles have been generated .

Sum up ,M Documents will correspond to M Independent Dirichlet-Multinomial Conjugate structure ,K individual topic Will correspond to K Independent Dirichlet-Multinomial Conjugate structure .

• among ,→θ→z Represents the subject of all words in the generated document , obviously  →θ The corresponding is Dirichlet Distribution ,θ→z The corresponding is Multinomial Distribution , So the whole is one Dirichlet-Multinomial Conjugate structure , As shown in the figure below ：
• Allied ,→φ→w, Easy to see , here β→φ The corresponding is  Dirichlet Distribution , φ→w The corresponding is Multinomial Distribution , So the whole is a Dirichlet-Multinomial Conjugate structure , As shown in the figure below ：

4.3.4 pLSA Follow LDA Comparison of parameter estimation methods

It's a comparison of pLSA Follow LDA Different processes for generating documents , below , Let's turn it around , Assume that the document has generated , Back to its theme distribution . that , What's the difference between them in estimating unknown parameters ？

• stay pLSA in , We use EM Algorithm to estimate “ The theme - Term ” matrix Φ（ from Convert to get ） and “ file - The theme ” matrix Θ（ from Convert to get ） These two parameters , And both of these parameters are fixed values , It's just unknown , The idea used is actually maximum likelihood estimation MLE.
• And in the LDA in , It is estimated that Φ、Θ These two Unknown parameters can be used as variations (Variational inference)-EM Algorithm , It can also be used. gibbs sampling , The idea of the former is Maximum posterior estimate MAPMAP And MLE similar , All take unknown parameters as fixed values , The idea of the latter is Bayesian estimation . Bayes is right MAP An extension of , But it has something to do with MAP There are essential differences , That is to say, Bayesian estimation regards the parameters to be estimated as random variables subject to a priori distribution .
• Another example of Bayesian estimation . Suppose there are only two kinds of universities in China ： Science and engineering and liberal arts , The ratio of the number of these two schools is 1:1, among , The proportion of men and women in science and Engineering 7:1, The ratio of men to women in Liberal Arts 1:7. One day you were randomly thrown to a campus by aliens , Ask you what is the possible ratio of men to women in the school ？ then , You actually went around the campus , What you see 5 All of us are men , At this time, I will ask you again what is the sex ratio in this campus ？
1. Because at the beginning , Have prior knowledge , So the sex ratio in this school is either 7:1, Or 1:7, namely P( The ratio is 7:1) = 1/2,P( The ratio is 1:7) = 1/2.
2. And then see 5 Re estimate the sex ratio after boys , In fact, it is to ask for P( The proportion 7:1|5 A boy ）= ？,P( The proportion 1:7|5 A boy ) = ？
3. Use the Bayesian formula , Available ：P( The proportion 7:1|5 A boy ) = P( The proportion 7:1)*P(5 A boy | The proportion 7:1) / P(5 A boy ),P(5 A boy ) yes 5 A boy's prior probability , It has nothing to do with school , So it's a constant ; Allied ,P( The proportion 1:7|5 A boy ) = P(( The proportion 1:7)*P(5 A boy | The proportion 1:7)/P(5 A boy ).
4. Finally, compare the above two equations , Available ：P( The proportion 7:1|5 A boy )/P( The proportion 1:7|5 A boy ) = {P(( The proportion 7:1)*P(5 A boy | The proportion 7:1)} / { P( The proportion 1:7)*P(5 A boy | The proportion 1:7)}.

because LDA Consider the subject distribution and word distribution to be estimated as its prior distribution Dirichlet Random variable of distribution , therefore , stay LDA This estimates the distribution of themes 、 In the process of word distribution , Their prior distribution （ namely Dirichlet Distribution ） Given beforehand , that LDA That is to find their posterior distribution （LDA Available in the gibbs Sampling to solve their posterior distribution , Get expectations ）！

Besides , Don't bother to insert another sentence , stay LDA in , Topic distribution and word distribution are multi distribution , And from above 3.2 We can see that “Dirichlet Distribution is the conjugate prior probability distribution of polynomial distribution ”, So choose Dirichlet Distribution as their conjugate prior distribution . It means parameters for multiple distribution p The prior distribution chosen is Dirichlet Distribution , So in order to p The posterior distribution obtained by Bayesian estimation for the multiple distribution of parameters is still Dirichlet Distribution .

4.3.5 LDA Parameter estimation ：Gibbs sampling

Sort out the LDA The physical process in , Let's take a look at how to learn estimation .

Be similar to pLSA,LDA In the original paper, we use the variation -EM Algorithm estimates unknown parameters , Later, another estimate was found LDA Unknown parameters are better , And the way to do that is ：Gibbs Sampling, Sometimes it's called Gibbs Sampling or Gibbs Sampling , They all mean the same thing .Gibbs Sampling is Markov chain Monte Carlo theory （MCMC） To obtain a series of approximations equal to the specified multidimensional probability distribution （ such as 2 The joint probability distribution of one or more random variables ） The algorithm of observing samples .

OK, Given a set of documents ,w Is a known variable that can be observed , and It's a priori parameter given by experience , Other variables z,θ and φ All of them are unknown hidden variables , We need to learn to estimate according to the observed variables . according to LDA The graph model of , You can write the joint distribution of all variables ：

notes ： In the above formula and below , Equivalent to , Equivalent to , Equivalent to , Equivalent to .

because Generate theme distribution θ, Theme distribution θ Identify specific topics , And Produce word distribution φ、 Word distribution φ Identify specific words , So the above formula is equivalent to Joint probability distribution

among , The first factor It is based on a certain theme Prior distribution parameters of sum word distribution The process of sampling words , The second factor According to the prior distribution parameters of the topic distribution The process of sampling topics , These two factors are Two unknown parameters to be calculated .

Because these two processes are independent , So we can deal with it separately , Crush one by one .

The first factor , It can be based on a certain theme And from a priori distribution The word distribution obtained by sampling Φ produce ：

Because the words in the sample are subject to parameters Independent multiple distribution of , This means that the product of the above pairs of words can be divided into two levels of product of the subject and the pair of words ：

among , complement t In the theme k Is the number of times .

Go back to the first factor . Target distribution We need word distribution Φ integral , And combined with us before 3.1 Section Dirichlet Normalized coefficient of distribution Formula

Available ：

This result can be seen as K individual Dirichlet-Multinomial The product of the model .
Now let's start looking for the second factor . Be similar to Steps for , First write the conditional distribution , And then it's broken down into two parts ：

among , The word for i The document to which it belongs , It's the theme k In the article m Is the number of times .

Distribution of themes Θ You can get points ：

The result of combining the first factor and the second factor , obtain The joint distribution result of is

Next , With joint distribution , Then we can calculate the observable variables by joint distribution w The hidden variable under  z  The conditional distribution of （ Posterior distribution ） To do Bayesian analysis .

In other words , With this joint distribution , Ask for the answer m Of the documents n Word （ Subscript to be The word ） All the conditional probabilities of are easy to find .

Define a few variables first . To remove The word ,,.

then , Exclude the subject assignment of the current word , That is to say, the probability formula for calculating the theme of the current word according to the theme assignment of other words and the observed words is ：

Corrigendum ： in consideration of , So the molecules in the second line of the above formula , Not p(w,z) *p(z), It is p(w|z)*p(z).

And there are ：

The last step , It's based on Markov The state of the chain Get the parameters of theme distribution Θ And the parameters of word distribution Φ.

In other words, according to Bayesian law and Dirichlet transcendental , And what we've got above and The result of the product of two parts , It can be calculated that On every document Topic A posteriori distribution of each Topic The posteriori distribution of the following words is as follows （ As we can see above ： The posterior distribution is the same as their prior distribution , All of them are Dirichlet Distribution

among , It's about making documents m The subject number vector of , It's a theme k The term number vector of .

Besides , Don't forget the above 2.4 Section Dirichlet A property of , as follows ：

“  If , It can also be proved that the following conclusions are true ：

namely ： If , be Any element of The expectation is ：

It can be seen that , Hyperparameters The intuitive meaning of is the pseudo counting of event priors (prior pseudo-count).
therefore , The final solution Dirichlet The distribution expectation is

And then and The result is substituted by the result obtained before The results of , Available ：

Observe the above results carefully , You can find , The right half of the equation is , The value of this probability corresponds to The path probability of . such ,K individual topic Corresponding K Paths ,Gibbs Sampling It's right here K Sampling in paths , As shown in the figure below ：

How wonderful , That's it ,Gibbs Sampling By solving the posterior distribution of topic distribution and word distribution , So we can successfully solve the problem that the two parameters of topic distribution and word distribution are unknown .

After this article was published , Some enthusiastic readers share their own understanding on Weibo LDA What I learned , Also welcome more friends to share your understanding （ For example, under this article , Or comment on Microblogging On ）, To share 、 During the discussion, more people can better understand ：

1. @SiNZeRo：lda If you use em Namely map Estimated . lda The intention is to find the posterior distribution And then take the posterior distribution to do bayesian analysis . such as theta The expectations of the . Instead of introducing a priori as regularization . And finally gibbs sampling In fact, it is not the process of solving Yes explore Posterior distribution To sample To ask for expectation .
2. @ researcher July： Good questions, good suggestions , I've been perfecting these days ！//@ Shuai Guangying s：LDA How to use this thing ？ Where can I use it ？ And that is Gibbs What is the principle of sampling ？ How to implement the code ？ If you use EM To do it , How to implement the code ？ LDA What are the deformation and optimization of the model ？LDA It doesn't apply to what kind of problems ？ All in all , I don't know how to use , How to optimize parameters ？

3. @xiangnanhe： Write very well ,4.1.3 The two pictures in the section are very good , Very intuitive understanding of LDA The model with a priori is better than PLSI More flexible ;PLSI It is easy to fall into local minimum then overfitting.

4. @asker2： Whether it's pLSA in , still LDA in , Theme distribution and word distribution are fixed existence , But it's all unknown .pLSA Follow LDA The difference is that , The way to explore these two unknown parameters is not the same .pLSA It is to find the best parameter that can fit the text （ Distribution ）, This value is considered to be the real parameter . but LDA Think , In fact, we can't solve the topic distribution completely 、 What is the parameter of word distribution , We can only think of them as random variables , By reducing its variance （ Degree of change ） To try to make this random variable more “ exact ”. In other words , We don't want theme distribution anymore 、 The specific value of word distribution , It's the observations generated by these distributions （ Actual text ） To deduce the parameter range of the distribution , That is, to what extent it is possible , In what range it is unlikely that . therefore , In fact, this is the idea of Bayesian analysis , Although it is impossible to give the actual value , But according to experience, we can give a priori distribution that the relative reasonable real value obeys , Then we can solve the posterior distribution from the prior .

5. ..

## 6 References and recommended reading

1. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Latent Dirichlet allocation（LDA The original paper ）：http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.
2. Blei. Probabilistic Topic Models：http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf, Translation of a netizen ：http://www.cnblogs.com/siegfang/archive/2013/01/30/2882391.html;
3. a pile wikipedia, For example, the implied Dirichlet distribution LDA Of wiki：http://zh.wikipedia.org/wiki/%E9%9A%90%E5%90%AB%E7%8B%84%E5%88%A9%E5%85%8B%E9%9B%B7%E5%88%86%E5%B8%83, Dirichlet distributed wiki：http://zh.wikipedia.org/wiki/%E7%8B%84%E5%88%A9%E5%85%8B%E9%9B%B7%E5%88%86%E5%B8%83;
4. From Bayesian method to Bayesian network  ;
5. rickjin Of LDA Math gossip （ The testimonials , Some of the images and formulas in this article are from this document ） Web version ：http://www.flickering.cn/tag/lda/,PDF edition ：http://emma.memect.com/t/9756da9a47744de993d8df13a26e04e38286c9bc1c5a0d2b259c4564c6613298/LDA;
6. Thomas Hofmann.Probabilistic Latent Semantic Indexing（pLSA The original paper ）：http://cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf;
7. Gregor Heinrich.Parameter estimation for text analysis（ About Gibbs sampling The most precise and detailed exposition ）：http://www.arbylon.net/publications/text-est.pdf;
8. Probabilistic latent semantic analysis (pLSA)：http://blog.tomtung.com/2011/10/plsa/http://blog.tomtung.com/2011/10/plsa/.
9. 《 Second edition of probability theory and mathematical statistics Mao Shisong and others 》, If you forget the statistical distribution , It is suggested to review this book or This article The second part ;
10. General introduction to support vector machines ： understand SVM Three levels of state 》, The second part is about Lagrange function ;
11. Machine learning class No 11 Next time in class , Zou Bo said EM & GMM Of PPT：http://pan.baidu.com/s/1i3zgmzF;
12. Machine learning class No 12 Next time in class , Zou Bo talks about the theme model LDA Of PPT：http://pan.baidu.com/s/1jGghtQm;
13. Theme model pLSA：http://blog.jqian.net/post/plsa.html;
14. Theme model LDA：http://blog.jqian.net/post/lda.html;
15. Search for the secret behind —— On semantic subject computing ：http://www.semgle.com/search-engine-algorithms-mystery-behind-search-on-the-calculation-of-semantic-topic;
16. LDA Of EM deduction ：http://www.cnblogs.com/hebin/archive/2013/04/25/3043575.html;
17. Machine Learning Book club No 8 In the last period , Shen Bo talks about the theme model PPT：http://vdisk.weibo.com/s/zrFL6OXKgKMAf;
18. Latent Dirichlet Allocation （LDA）- David M.Blei：http://www.xperseverance.net/blogs/2012/03/17/;
19. use GibbsLDA do Topic Modeling：http://weblab.com.cityu.edu.hk/blog/luheng/2011/06/24/%E7%94%A8gibbslda%E5%81%9Atopic-modeling/#comment-87;
20. The application of topic model in text mining ：http://net.pku.edu.cn/~zhaoxin/Topic-model-xin-zhao-wayne.pdf;
21. Binomial distribution and multiple distribution ,beta Contrast of distribution ：http://www.cnblogs.com/wybang/p/3206719.html;
22. LDA brief introduction ：http://cos.name/2010/10/lda_topic_model/;
23. LDA Related papers 、 Tool library ：http://site.douban.com/204776/widget/notes/12599608/note/287085506/;
24. A netizen studies LDA What I learned ：http://www.xuwenhao.com/2011/03/20/suggestions-for-programmers-to-learn-lda/;
25. http://blog.csdn.net/hxxiaopei/article/details/7617838;
26. Theme model LDA And recommend it on Weibo & The application of advertising algorithm ：http://www.wbrecom.com/?p=136;
27. LDA One of the inventors Blei My graduation thesis ：http://www.cs.princeton.edu/~blei/papers/Blei2004.pdf;
28. LDA One of the C Realization ：http://www.cs.princeton.edu/~blei/lda-c/index.html;
29. LDA Some other information about ：http://www.xperseverance.net/blogs/2012/03/657/.

## 7 Postscript

This LDA The notes from 11 month 17 Start writing on the afternoon of , To 21 The day is almost finished ,25 The day is basically over , Before and after , Basically finished + Basically finished , It took me nearly 10 Time of day , It needs to be improved . front 5 It's like walking in the woods , The general direction is very clear , But it took a lot of trouble to choose which path , But when finally out of the woods , Go to the top of the mountain , Overlooking the whole forest , Mr. , It's like this , There will be a feeling of total satisfaction ！ Then 5 God , Then slowly start to approach LDA The essence of ：pLSA Bayes version of .

The writing process is difficult but the result is thorough , I hope readers can enjoy it .

Last , Thank you again for the main reference of this article ：LDA The original paper 、pLSA The original paper 、LDA Math gossip 、 Machine learning class No 12 The theme model of the second class PPT, and Parameter estimation for text analysis And so on （ Most of the pictures in this article 、 The formula is taken from these references ）, Because of their creation or sharing , I have a chance to understand and reprocess LDA, Finally, let this article be written .

In the next few days, we will continue to revise and improve this article , If there are any questions , You can comment on ,thanks.

July、 November 21, 2014 .

https://chowdera.com/2020/12/20201207124140815e.html