# Detailed explanation of probability graph model

2021-04-03 02:46:52 mathor

B Station explanation

### Probability graph model

Consider three random variables a,b,c, Its joint probability distribution is ：

• Abstract the above three random variables into Directed graph Medium 3 Nodes
• For each conditional probability , Add corresponding links to the diagram （ arrow ）, The specific rule is ： The starting point of the arrow is the node corresponding to the condition in the conditional probability , The end point of the arrow is the node corresponding to the conclusion in conditional probability , for example P(b\mid a) From a The node leads out an arrow pointing to b node . Specially , For probability P(a), Because it's not conditional probability , So there's no node pointing to it
• If there is a node a To the node b Arrows , It is called node a It's a node. b The parent node , node b It's a node. a The child node of

actually ,P(a,b,c) Through the above three rules, we can make a definite digraph

Probability graph model （Probabilistic Graphical Model） It is a kind of probability model that uses graph to express the relationship between random variables ：

• Use a node to represent one or A group of A random variable
• The edges between nodes represent the probability relations among variables

Depending on the nature of the edge , Probability graph models can be roughly divided into two categories ：

1. Use a directed acyclic graph to represent the dependence between random variables , be called Bayesian network , It is applicable to the causal relationship between random variables
2. Use an undirected graph to represent the correlation between random variables , be called Markov networks , It applies to the relationship between random variables , But it's hard to show

### Conditions are independent

Consider three random variables a,b,c, And suppose that given b,c Under the condition of a The conditional probability distribution of does not depend on b Value , namely

Mark

To give c Under the condition of （ Or say c In the case of observation ）a,b Conditions are independent , In fact, conditional independence can be extended to the range of sets , That is, given a set X Under the condition of ,Y,Z Conditions are independent . When using probability models , Conditional independence plays an important role , It simplifies the structure of the model , It reduces the computation of model training and inference

### Bayesian network

Bayesian network structure \mathcal{G} It's a directed acyclic graph , Each node corresponds to a random variable . If there is a direct dependence between two random variables , Then connect them with an edge with an arrow . Bayesian network structure effectively expresses the conditional independence between features , It assumes that each node is only related to its immediate parent , And independent of other nodes . But usually , Each node has more than one parent , So we're going to X_i The set of parent nodes of is denoted as P_{X_i}

actually , The above bold part of the normalization is ： Bayesian networks assume that each feature is independent of the features expressed by its non descendant nodes . But because it's hard to understand , So I made some changes to it

The typical dependency relationship among three nodes in Bayesian network is shown in the figure below ：

#### The same parent structure

actually , According to the normal joint probability formula , Then there are

According to Bayesian hypothesis , Then there are

To sum up the above two formulas , It takes... To get rid of the same parts {\color{red} {P(X_3\mid X_1,X_2)=P(X_3\mid X_1)}}

Conclusion ：X_1 Under observation ,X_2 and X_3 Independent ;X_1 Without being observed ,X_2 and X_3 Not independent . namely X_2\perp X_3\mid X_1

#### Sequential structure

actually , According to the normal joint probability formula , Then there are

According to Bayesian hypothesis , Then there are

To sum up the above two formulas , It takes... To get rid of the same parts {\color{red} {P(X_2\mid X_1,X_3)=P(X_2\mid X_1)}}

Conclusion ：X_1 Under observation ,X_2 and X_3 Independent ;X_1 Without being observed ,X_2 and X_3 Not independent . namely X_2\perp X_3\mid X_1

#### V Type structure

actually , According to the normal joint probability formula , Then there are

According to Bayesian hypothesis , Then there are

To sum up the above two formulas , It takes... To get rid of the same parts {\color{red} {P(X_3\mid X_2)=P(X_3)}}\Rightarrow {\color{red} {P(X_2,X_3)=P(X_2)P(X_3)}}

Conclusion ：X_1 Under observation ,X_2 and X_3 Not independent ;X_1 Without being observed ,X_2 and X_3 Independent . namely X_2 \not \perp X_3\mid X_1. It can be explained by a popular example ,X_2 and X_3 A man and a woman , When these two people don't have children X_1 when ,X_2 and X_3 It's independent ; When they have children X_1 after ,X_2 and X_3 It's not independent Specially , If X_1 There are children X_4, And X_4 Under observation , No matter what X_1 What is the status ,X_2 and X_3 It's not independent

#### D- Divide

In order to judge whether any two nodes in Bayesian network are independent , We need to use D- Divide .D- Partition is actually a generalization of the three basic topological structures of Bayesian networks , The node relation is extended to the set relation

Let's first summarize the simple node relationship , And then it's extended to the set

There are nodes a,b, If in a and b There is a set of path nodes between V=\{v_1,v_2,...,v_n\}, If all nodes in the node set v_i Satisfy ：

1. v_i To be observed , And v_i The topological structure is the same parent structure or sequential structure
2. v_i Not observed , And v_i The topology is V Type structure

said a,b Independent

notes ：a To b Nodes on a path v_i Actually, there's more than one , But as long as one satisfies any of the above conditions , That path is blocked . also a To b There may be more than one path （ Ignore arrow direction ）, Only when all paths are blocked , Just think a and b Blocked

Generalization to set ： If there is a set of nodes A,B, If you're assembling A From any node in to B Any node in , All meet the above conditions , It's called a collection A,B Independent

As shown in the figure below , In the given X_2 and X_3 Under the circumstances ,X_1 and X_6 It's independent , namely X_1\perp X_6\mid (X_2, X_3). say concretely , from X_1 To X_6 There are two paths ：

1. X_1\to X_2\to X_6, among X_2 It was observed that , And X_2 It's a sequential structure , So the path is blocked
2. X_1\to X_3\to X_6, among X_3 It was observed that , And X_3 It's a sequential structure , So the path is blocked

As shown in the figure below , In the given X_1 and X_6 Under the circumstances ,X_2 and X_3 It's not independent , namely X_2\not \perp X_3\mid (X_1,X_6). say concretely , from X_2 To X_3 There are two paths ：

1. X_2\to X_6\to X_5\to X_3, among X_6 It was observed that , And X_6 yes V Type structure , So the path is not blocked
2. X_2\to X_1\to X_3, among X_1 It was observed that , And X_1 It's the same parent structure , So the path is blocked

Only when all paths are blocked ,X_2 and X_3 It's independent , therefore X_2 and X_3 It's not independent

In understanding the D- After the division , We can simplify the complicated probability formula , as follows ：

use D- Partition simplification ：

1. about P(A\mid C), because T yes V Type structure , And not observed , therefore A and C Independent , so P(A\mid C)=P(A)
2. about P(B\mid A,C), And 1 Empathy , therefore B and A Independent , so P(B\mid A,C)=P(B\mid C)
3. about P(T\mid A,B,C), because C It's the same parent structure , And it has been observed （ Because there is P(C)）, therefore B and T Independent , so P(T\mid A,B,C)=P(T\mid A,C)
4. about P(L\mid C,A,B,T), because A It's the same parent structure , And it has been observed （ Because there is P(A)）, therefore L With the exception of A All nodes are independent , so P(L\mid A)

Therefore, the calculation formula is simplified as

### * Moral map

Reading this section can help you better understand D- Divide . In order to analyze the conditional independence between nodes in digraph , We will use D- Divide , There's nothing wrong with the technology itself , But it's really not suitable for human to do , So we consider turning a digraph into an undirected graph , The connection of the edges represents the relationship between them , The specific steps are as follows ：

1. Find all of them in a digraph V Type structure , Add an undirected edge between its two parent nodes
2. Change all directed edges to undirected edges

The resulting undirected graph is called Moral map （Moral Graph）, The process of connecting parents is called moralization . Based on the moral map can be intuitive , Quickly find the conditional independence between nodes . As shown in the figure below ：

Moral map is a way to judge independence ： Let there be variables in the moral graph x,y And the set of observed variables \mathbb{Z}=\{z_i\}, If the variable x and y Can be shown on a graph \mathbb{Z} Separate , That is to set variables from the moral map \mathbb{Z} After removing ,x and y It belongs to two connected components , be x\perp y\mid \mathbb{Z} establish

It should be noted that , The conditional independence judged by the moral graph must be true in the original digraph , But not the other way around , Some conditional independence in digraph can not be judged from moral graph

### infer

In the graph model , infer （Inference） It's when some variables are observed \mathbb{E}=\{e_1,e_2,...,e_m\} when , Calculate some other variables \mathbb{Q}=\{q_1,q_2,...,q_n\} The posterior probability of P(\mathbb{E}\mid \mathbb{Q}). The inference methods of probability graph model can be divided into two categories ：

1. To infer exactly , It can be divided into Variable elimination and The spread of faith The two methods , But because of the complexity , So the scope of application is limited
2. Approximate inference , This article does not introduce , You can read this if you are interested article

#### Variable elimination （Variable elimination）

Variable elimination algorithm is to solve the edge distribution of a random variable , Get... By eliminating other variables （ Sum other variables for joint probability , Then based on conditional independence, we transform it into conditional probability multiplication of related variables ）

In fact, students who have studied probability theory should be impressed , For discrete random variables , Edge probability is summation ; For continuous random variables , The marginal probability is solved by integral . Here is an example , A random variable a=\{1,1,2,3\},b=\{1,2\},c=\{1,2\}, First, eliminate random variables by summation b, And then eliminate the random variables c, The resulting a The marginal probability of （ Because the decimal point is not accurate enough , So summation is not 1, Do not care about details ）

The following is an example , Explain the workflow of variable elimination in detail . Suppose the goal of inference is to calculate the edge probability P(X_5)

For calculation P(X_5), You just need to eliminate variables by addition X_1,X_2,X_3,X_4 that will do

According to the structure of Bayesian Networks , combination D- The partition algorithm can combine the joint probability P(X_1,X_2,X_3,X_4,X_5) Simplify

Bring it up , Rearrange \sum The location of , Then there are

Definition

\begin{aligned} m_{1,2}(X_2)&=\sum_{X_1}P(X_1)P(X_2\mid X_1)\\ m_{2,3}(X_3)&=\sum_{X_2}P(X_3\mid X_2)m_{1,2}(X_2)\\ m_{4,3}(X_3)&=\sum_{X_4}P(X_4\mid X_3)\\ m_{3,5}(X_5)&=\sum_{X_3}P(X_5\mid X_3)m_{2,3}(X_3)m_{4,3}(X_3) \end{aligned}

among ,m_{1,2}(X_2) Is only X_2 Function of ;m_{2,3}(X_3),m_{4,3}(X_3) Is only X_3 Function of ;m_{3,5}(X_5) Is only X_5 Function of

So there is

\begin{aligned} P(X_{5})&=\sum_{X_{3}} P(X_{5} \mid X_{3}) \sum_{X_{4}} P(X_{4} \mid X_{3}) \sum_{X_{2}} P(X_{3} \mid X_{2}) \sum_{X_{1}} P(X_{1}) P(X_{2} \mid X_{1}) \\ &=\sum_{X_{3}} P(X_{5} \mid X_{3}) \sum_{X_{4}} P(X_{4} \mid X_{3}) \sum_{X_{2}} P(X_{3} \mid X_{2}) m_{1,2}(X_{2}) \\ &=\sum_{X_{3}} P(X_{5} \mid X_{3}) \sum_{X_{4}} P(X_{4} \mid X_{3}) m_{2,3}(X_{3}) \\ &=\sum_{X_{3}} P(X_{5} \mid X_{3}) m_{2,3}(X_{3}) \sum_{X_{4}} P(X_{4} \mid X_{3}) \\ &=\sum_{X_{3}} P(X_{5} \mid X_{3}) m_{2,3}(X_{3}) m_{4,3}(X_{3}) \\ &=m_{3,5}(X_{5}) \end{aligned}

actually ,m_{4,3}(X_3)=\sum_{X_{4}}P(X_4\mid X_3)=1, therefore P(X_5)=\sum_{X_3}P(X_5\mid X_3)m_{2,3}(X_3)

summary ： The variable elimination method uses the distributive law of multiplication to addition , The summation of the product of multiple variables is transformed into the problem of alternately summation and product of some variables

#### The spread of faith （Belief propagation）

The belief propagation algorithm regards the summation operation in the variable elimination method as a message passing process , The problem of repeated calculation in solving multiple marginal distributions is solved

Formally , Hypothesis function m_{i,j}(X_i) Represents a part of the intermediate result of probability summation of random variables , for example \sum_{X_1}P(X_1)P(X_2 \mid X_1) It can be expressed as m_{1,2}(x_2). Then the summation operation of belief propagation algorithm can be expressed by the following formula ：

In the above formula ,\psi(X_i, X_j) Represents a variable X_i and X_j A function of the relevance of , It can be conditional probability or joint probability distribution ;n(i) In the probability graph X_i Adjacent node variable set . In the belief propagation algorithm , Each message passing operation is only related to X_i And its adjacent nodes , The computation of message passing is limited to local parts of the graph

Be careful , In the spread of faith , This function m_{ij}(X_j) It can be expressed as a node X_i towards X_j A message delivered

In the belief propagation algorithm ：

• A node Only after receiving messages from all other nodes To send a message to another node
• The edge probability of the node is proportional to the product of the received messages ：P(X_i)\propto \prod_{k\in n(i)}m_{k,i}(X_i)

If in the graph structure No ring , Then the belief propagation algorithm can complete all message transmission through two steps , And then calculate the marginal distribution of all variables ：

1. Specify a root node , Start from all leaf nodes and pass messages to the root node , Until the root node receives messages from all neighboring nodes .
2. From the root node to the leaf node , Until all leaf nodes receive messages

#### * Popular explanation Belief Propagation The idea of algorithm

The situation a ： A number of soldiers lined up A line , Every soldier Can only communicate with the soldiers next to him , Ask how to Let every soldier know the total number of soldiers

obviously , To any soldier in it , They can only communicate with their neighbors and pass on information . For example, the first one on the far left passes to the second one ：" The second one is on the left 1 personal ". Second to third ：" Third on the left is 2 individual ". All the way to the last one on the far right , The last one got the message that he had... On his left 5 personal

But at this time, only the person on the far right knows the total number （5+1=6）, You can't make other people know the total number yet . Others only know how many people are on his left , How many on the right don't know

Next, the messages are passed from right to left . For example, the first on the far right tells the second on the far right ：" The second one is just him on the right 1 individual "...... After passing forward and backward each time , Everyone has two messages , One is on the left side of the soldier L personal , One is on the right R personal , In the end, everyone can know the total number （L+R+1）

But it's not efficient , How can it be faster ？ We can At the same time, both sides send messages in opposite directions , Because it doesn't affect the workflow

Scenario two ： A number of soldiers stood in their positions , Some soldiers have more than 2 Two adjacent soldiers , There may be 3 One or more . Every soldier Can only communicate with the soldiers next to him , Ask how to Let every soldier know the total number of soldiers

The picture above is already with BP Algorithms are linked . The result of propagation is that each node can get messages from other nodes , So we can calculate the edge function

Scenario three ： A number of soldiers lined up The loop , Every soldier Can only communicate with the soldiers next to him , Ask how to Let every soldier know the total number of soldiers

Because of the loop , If the above information update method is used to determine the total number of people , will As a result, it is impossible to determine when to stop the transmission of information , As a result, it is impossible to determine the number of soldiers

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

https://chowdera.com/2021/04/20210402125115352J.html