当前位置:网站首页>From Bayesian method to Bayesian network

From Bayesian method to Bayesian network

2020-12-07 13:11:38 osc_ c0usoa3v

  From Bayesian method to Bayesian network

 

0 introduction

    in fact , Introducing Bayes theorem 、 Bayesian method 、 Bayesian inferences 、 There are many books , such as 《 A brief history of mathematical statistics 》, as well as 《 Statistical decision making and Bayesian analysis James O.Berger Writing 》 wait , However, there are very few Chinese materials about Bayesian networks , There are few Chinese books in total , Most of them are in English , But as soon as the beginner comes up, he throws a pile of English papers , It's a pity that I can't continue to read because I have no foundation or language barrier ( Of course , With a certain foundation , Then you can read more English materials ).

    11 month 9 The morning of , Machine learning class   The first 9 The next lesson is about Bayesian networks , To help you refine the key points of Bayesian network : The definition of Bayesian network 、3 A structural form 、 Factor map 、 as well as Summary-Product Algorithm, etc. , I know what Bayesian networks are , How do you do it? , After what the goal is , I believe I can understand English papers better .

    So this article combines the lecture notes and related reference materials , From the Bayesian approach , Focus on Bayesian Networks , It can still be defined as a reading note or learning note , Any questions , You are welcome to point out ,thanks.

 

 

1 Bayesian method

    For a long time , The probability that something will or will not happen , Only fixed 0 and 1, Either or , Or not , Never think about the probability of something happening , What's the probability of not happening . And the probability Although unknown , But at least a certain value . For example, if you ask people at that time a question :“ There is a bag , There are some white and black balls in it , What's the probability of getting the white ball from the bag How much is the ?” They don't even have to think about it , I'll tell you right away , The probability of taking out the white ball Namely 1/2, Or get the white ball , Or you can't get the white ball , namely θ There can only be one value , And no matter how many times you take , The probability of getting a white ball θ It's always 1/2, I don't follow the observation X Change by change .

    This idea of frequency has long dominated people's ideas , Until then a man named Thomas Bayes The characters of .

1.1 Bayes method

    Thomas · Bayes Thomas Bayes(1702-1763) In the world , Not known to people at that time , Rarely publish papers or books , There was little communication with people in the academic circles at that time , In the present words , Bayes is a living folk Academy “ Prick silk ”, But this “ Prick silk ” Finally published a piece called “An essay towards solving a problem in the doctrine of chances”, The translation comes from : The solution to a problem in the theory of opportunity . You may think I'm going to say : The publication of this paper produced a random sensation , So as to establish the position of Bayes in the academic history .

            

    in fact , After the last paper was published , It didn't have much impact at the time , stay 20 After a century , This paper has been paid more and more attention to . Regarding this , How similar to Van Gogh , A painting is worth nothing in life , It's priceless after death .

    Back to the example above :“ There is a bag , There are some white and black balls in it , What's the probability of getting the white ball from the bag θ How much is the ?” Bayes thinks the probability of getting a white ball It's an uncertain value , Because there is an element of opportunity . such as , A friend starts a business , You know that there are only two kinds of results , Either success or failure , But you can't help but estimate the probability of his success ? If you know him better , And there's a way 、 have a lucid brain 、 Have perseverance 、 And can unite people around , You can't help but estimate the probability of his success 80% above . This is different from the first “ Black and white 、 Not 0 namely 1” The way of thinking , It's the Bayesian way of thinking .

    Before moving on to Bayesian methods , First, briefly summarize the different ways of thinking of frequency school and Bayes school :

  • The frequency faction puts the parameters that need to be inferred θ As a fixed unknown constant , That's probability Although unknown , But at least a certain value , meanwhile , sample X Is random , So the frequency school focuses on the sample space , Most of the probability calculations are for samples X The distribution of ;
  • The Bayesian view is the opposite , They think the parameters It's a random variable , And the sample X Is constant , Because the sample is fixed , So they focus on parameters The distribution of .

    relatively speaking , The idea of frequency school is easy to understand , So the following focuses on the view of Bayesian school .

    Since the Bayesian school put Think of it as a random variable , So calculate The distribution of , You have to know in advance The unconditional distribution of , Before there are samples ( Or observed X Before ), What's the distribution ?

    For example, throw a ball on the billiard table , Where will the ball fall ? If it's an impartial throw , Then this ball has the same chance to fall on any position on the billiard table , That is, the probability of the ball falling to a certain position on the billiard table Subject to uniform distribution . This kind of distribution which belongs to the basic premise property determined before the experiment is called the prior distribution , or The unconditional distribution of .

    thus , Bayes and Bayes school put forward a fixed mode of thinking :

  • Prior distribution  + Sample information Posterior distribution

    The above thinking pattern means , The newly observed sample information will modify people's previous cognition of things . In other words , Before getting new sample information , People are right. Our cognition is a prior distribution , Getting new sample information after , People are right. The cognition of is .

        among , Prior information generally comes from experience and historical data . For example, Lin Dan and a contestant , Generally speaking, the commentator will make a general judgment on the victory or defeat of this competition based on the results of all previous competitions of lindane . Another example , A factory has to inspect its products every day , In order to evaluate the unqualified rate of products θ, After a period of time, a lot of historical data will be accumulated , These historical materials are transcendental knowledge , With this transcendental knowledge , It has a basis in deciding whether a product needs to be inspected every day , If past historical data show , The unqualified rate of a product is only 0.01%, It can be regarded as trustworthy products or exempt products , Only spot check once or twice a month , So as to save a lot of manpower and material resources .

    Then test the distribution It is also generally considered to be in a given sample Under the circumstances The conditional distribution of , And make To reach the maximum value It's called the maximum posterior estimate , Similar to MLE in classical statistics .

    Put it all together , It's just like human beings have little prior knowledge of nature at the beginning , But with constant observation 、 Experiment to get more samples 、 result , Make people more and more thorough about the laws of nature . therefore , Bayesian method is in line with the way people think in their daily life , It also conforms to the law of people's understanding of nature , Through constant development , Finally, it occupies half of the statistical field , Against classical statistics .

    Besides , In addition to the above thinking pattern, Bayes , In particular, the world-famous Bayes Theorem .

1.2 Bayes theorem

    Before we introduce Bayes Theorem , Let's learn a few definitions first :

  • Conditional probability ( Also called posterior probability ) It's an event. A In another event B Probability of occurrence under the condition of occurrence . The conditional probability is expressed as P(A|B), pronounce as “ stay B Under the condition of A Probability ”.

such as , In the same sample space Ω An event or subset of A And B, If random from Ω One of the elements chosen in is B, So this randomly selected element belongs to A The probability of is defined as in B Under the premise of A Conditional probability of , therefore :P(A|B) = |A∩B|/|B|, Then the molecules 、 The denominator is divided by |Ω| obtain

  • joint probability The probability of two events occurring together .A And B The joint probability of is expressed as perhaps .
  • Edge probability ( Also known as prior probability ) It's the probability of an event . This is how the marginal probability is obtained : In the joint probability , Combine the unnecessary events in the final result into their total probability , And eliminate them ( For discrete random variables, sum the total probability , The total probability of integration for continuous random variables ), This is called marginalization (marginalization), such as A The marginal probability of is expressed as P(A),B The marginal probability of is expressed as P(B). 

    next , Consider a question :P(A|B) Is in B When it happens A The possibility of happening .

  1. First , event B Before it happened , We are... About the event A There is a basic probability of the occurrence of , be called A The prior probability of , use P(A) Express ;
  2. secondly , event B After that , We are... About the event A A reassessment of the probability of occurrence of , be called A The posterior probability of , use P(A|B) Express ;
  3. Allied , event A Before it happened , We are... About the event B There is a basic probability of the occurrence of , be called B The prior probability of , use P(B) Express ;
  4. Again , event A After that , We are... About the event B A reassessment of the probability of occurrence of , be called B The posterior probability of , use P(B|A) Express .

    Bayes theorem is based on the following Bayes formula :

    The derivation of the above formula is very simple , That's from the conditional probability .

    According to the definition of conditional probability , In the event B What happened under the conditions A The probability of occurrence is

 

    similarly , In the event A What happened under the conditions B Probability of occurrence

 

    Organize and combine the two equations , You can get :

 

 

    next , Both sides of the above formula divide by P(B), if P(B) It's not zero , We can get Bayes theorem The formula expression of :

 

    therefore , Bayes formula can be directly derived from the definition of conditional probability . Because P(A,B) = P(A)P(B|A) = P(B)P(A|B), therefore P(A|B) = P(A)P(B|A)  / P(B).

1.3 application : Spelling check

    Friends who often search for things on the Internet know , When you accidentally type in a word that doesn't exist , The search engine will prompt you if you want to enter a correct word , For example, when you are Google Input in “Julw” when , The system will guess your intention : Do you want to search “July”, As shown in the figure below :

    This is called spell checking . According to a Google employee article Show ,Google Based on Bayesian method . Let's take a look , How to use Bayesian method , Realization " Spelling check " The function of .

    When the user enters a word , Maybe the spelling is correct , It can also be misspelled . If you write down the correct spelling c( representative correct), Make a note of the spelling mistakes w( representative wrong), that " Spelling check " The thing to do is : In the event of a w Under the circumstances , Try to deduce c. In other words : It is known that w, Then in a number of alternatives , Find out the most likely one c, That is o The maximum of .
    And according to Bayes Theorem , Yes :

  

    Because for all the alternatives c Come on , They are all the same w, So their P(w) It's the same , So we just need to maximize

 

    that will do . among :

  • P(c) The appearance of a correct word " probability ", It can be used " frequency " Instead of . If we have a large enough text library , So the frequency of each word in this text library , It's equivalent to its probability of occurrence . The more frequently a word appears ,P(c) The greater the . For example, when you type a wrong word “Julw” when , The system is more likely to guess what you might want to type “July”, instead of “Jult”, because “July” More common .
  • P(w|c) Trying to spell c Under the circumstances , There are spelling mistakes w Probability . To simplify the problem , Let's say that two words are closer in form , The more likely it is to misspell ,P(w|c) The greater the . for instance , The spelling of a letter , It's more than two letters apart , More likely to happen . You want to spell the words July, So misspelled Julw( One letter apart ) The possibility of , It's better than putting together Jullw high ( Two letters apart ). It is worth mentioning that , This kind of problem is generally called “ Edit distance ”, See... On the blog This article article .

    therefore , We compare the frequency of all words with similar spelling in the text library , Then pick out the one with the highest frequency , That's the word users want to type most . For detailed calculation process and defects of this method, please refer to here .

 

 

2 Bayesian network

2.1 The definition of Bayesian network

    Bayesian network (Bayesian network), Also known as a belief network (Belief Network), Or directed acyclic graph model (directed acyclic graphical model), It's a probability graph model , On 1985 Year by year Judea Pearl First put forward . It is a kind of uncertainty processing model to simulate the causality in human reasoning process , The topological structure of the network is a directed acyclic graph (DAG). 

    Nodes in directed acyclic graph of Bayesian network represent random variables , They can be observable variables , Or hidden variable 、 Unknown parameters, etc . Think it's causal ( Or unconditional independence ) The variables or propositions of are connected by arrows . If two nodes are connected by a single arrow , Indicates that one of the nodes is “ because (parents)”, The other is “ fruit (children)”, Two nodes will produce a conditional probability value .

    To make a long story short , The arrow connecting the two nodes indicates that the two random variables are causal , Or unconditional independence .

    for example , Assume that node E Directly affect nodes H, namely E→H, Then from E Point to H Arrow to establish node E To the node H The directed arc of (E,H), A weight ( Connection strength ) Conditional probability P(H|E) To express , As shown in the figure below :

     in short , The random variables involved in a research system , Draw independently in a directed graph according to the condition , It forms the Bayesian network . It is mainly used to describe the conditional dependence between random variables , Using circles to represent random variables (random variables), Use arrows to show conditional dependencies (conditional dependencies).

     Make G = (I,E) Represents a directed acyclic graph (DAG), among I Represents a collection of all nodes in a drawing , and E Represents a collection of directed connection segments , And make X = (Xi)i ∈ I Is a node in its directed acyclic graph i Represented random variable , If the node X The joint probability of can be expressed as :

 

    said X Is a directed acyclic graph G  Bayesian network of , among , Representation node i And “ because ”, Or called pa(i) yes i Of parents( Parents ). 

    Besides , For any random variable , The joint probability can be obtained by multiplying the local conditional probability distribution :

    

    As shown in the figure below , It's a simple Bayesian network :

 

 

    because a Lead to b,a and b Lead to c, So there is

 

2.2 Of Bayesian networks 3 A structural form

    Given a Bayesian network as shown in the figure below :

    It can be seen intuitively from the figure :

  • 1. x1,x2,…x7 The joint distribution of is

  • 2. x1 and x2 Independent ( Corresponding head-to-head);
  • 3. x6 and x7 stay x4 Independence under given conditions ( Corresponding tail-to-tail).

    According to the picture above , The first 1 Points may be easy to understand , But the first 2、3 What do you mean by conditional independence ? Actually, the first one is 2、3 Point is in Bayesian network 3 Two of the structural forms . To make it clear , Need to introduce D-Separation(D- Separate ) The concept .

    D-Separation Is a graphical method to determine whether a variable is conditionally independent . In other words , For one DAG( Directed acyclic graph )E,D-Separation Method can quickly determine whether two nodes are conditionally independent .

2.2.1 form 1:head-to-head

    The first structure of Bayesian network is shown in the figure below :

    So there is :P(a,b,c) = P(a)*P(b)*P(c|a,b) establish , After simplification, we can get :

    That is to say c Under unknown conditions ,a、b Blocked (blocked), It's independent , be called head-to-head Conditions are independent , Corresponding to the figure at the beginning of this section “x1、x2 Independent ”.

2.2.2 form 2:tail-to-tail

    The second structure of Bayesian network is shown in the figure below

    consider c Unknown , Follow c Two situations are known :

  1. stay c In the unknown , Yes :P(a,b,c)=P(c)*P(a|c)*P(b|c), here , There is no way to come to P(a,b) = P(a)P(b), namely c When unknown ,a、b Not independent .
  2. stay c When we know , Yes :P(a,b|c)=P(a,b,c)/P(c), And then P(a,b,c)=P(c)*P(a|c)*P(b|c) Take it into the equation , obtain :P(a,b|c)=P(a,b,c)/P(c) = P(c)*P(a|c)*P(b|c) / P(c) = P(a|c)*P(b|c), namely c When known ,a、b Independent .

    therefore , stay c Given the conditions ,a,b Blocked (blocked), It's independent , be called tail-to-tail Conditions are independent , Corresponding to the figure at the beginning of this section “x6 and x7 stay x4 Independence under given conditions ”.

2.2.3 form 3:head-to-tail

    The third structure of Bayesian network is shown in the figure below :

    Or points c Unknown and c Two situations are known :

  1. c When unknown , Yes :P(a,b,c)=P(a)*P(c|a)*P(b|c), But it can't roll out P(a,b) = P(a)P(b), namely c When unknown ,a、b Not independent .
  2. c When known , Yes :P(a,b|c)=P(a,b,c)/P(c), And according to P(a,c) = P(a)*P(c|a) = P(c)*P(a|c), Can be reduced to :

    therefore , stay c Given the conditions ,a,b Blocked (blocked), It's independent , be called head-to-tail Conditions are independent .

    Insert a sentence : This head-to-tail In fact, it's a chain network , As shown in the figure below :

    According to the previous statement head-to-tail Explanation , We already know , stay xi Given the conditions ,xi+1 The distribution and x1,x2…xi-1 Conditions are independent . What does that mean ? signify :xi+1 The distribution state of is only the same as xi of , Independent of other variable conditions . Popular point theory , The current state is only related to one state , It has nothing to do with the state of being on top or before . This random process of sequential evolution , It's called the Markov chain (Markov chain). And there are :

    next , Extend the above nodes to the node set , It is : For any set of nodes A,B,C, Examine all pass A Any node in to B The path of any node in , If required A,B Conditions are independent , You need all the paths to be blocked (blocked), That is, to meet one of the following two prerequisites :

  1. A and B Of “head-to-tail type ” and “tail-to-tail type ” All the way through C;
  2. A and B Of “head-to-head type ” The path doesn't go through C as well as C The descendants of ;

    Last , Give an example of D-Separation Of 3 In this case ( That's the Bayesian network 3 A structural form ), As shown in the figure below :

 

    The left part of the picture above is head-to-tail, Given T when ,A and X Independent ; The upper right corner of the right part is tail-to-tail, Given S when ,L and B Independent ; The lower right corner of the right section is head-to-head, Not given D when ,L and B Independent .

2.3 Examples of Bayesian Networks

    Given the Bayesian network as shown in the figure below :

 

 

    among , Every word 、 The meaning of the expression is as follows :

  • smoking It means smoking , Its probability is P(S) Express ,lung Cancer Lung cancer , The probability of getting lung cancer in the case of smoking is P(C|S) Express ,X-ray It means that we need to take medical X light , Lung cancer may lead to the need for a picture X light , Smoking may also lead to the need for a picture X light ( therefore smoking It's also X-ray One of the reasons for ), therefore , Need to take a picture because of smoking and lung cancer X The probability of light is P(X|C,S) Express .
  • Bronchitis It means bronchitis , The probability of bronchitis in the case of smoking is P(B|S),dyspnoea It means difficulty in breathing , Bronchitis can cause breathing difficulties , Lung cancer can also cause breathing difficulties ( therefore lung Cancer It's also dyspnoea One of the reasons for ), Because of smoking and got bronchitis, the probability of dyspnea is P(D|C,B) Express .

    lung Cancer Short for C,Bronchitis Short for B,dyspnoea Short for D, And C = 0 Express lung Cancer The probability of not happening ,C = 1 Express lung Cancer Probability of occurrence ,B be equal to 0(B It doesn't happen ) or 1(B happen ) Similar to C, alike ,D=1 Express D Probability of occurrence ,D=0 Express D The probability of not happening , You can get dyspnoea A probability table of , As shown in the bottom right corner above .

2.4 Factor map

    go back to 2.3 On that example in Section , As shown in the figure below :

 

 

    For the picture above , In a person who has difficulty breathing (dyspnoea) Under the circumstances , It smokes (smoking) What's the probability of ? namely :

      Let's calculate and deduce step by step :

    Explain the derivation of the above formula :

  1. The second line : Yes, the joint probability is about b,x,c Sum up ( stay d=1 Under the condition of ), To eliminate b,x,c, obtain s and d=1 The joint probability of .
  2. The third line : In the beginning , All variables are in sigma(d=1,b,x,c) Behind (sigma Said to “ Sum up ” Appellation ), But because of P(s) and “d=1,b,x,c” It doesn't matter , therefore , You can mention the front of the formula . and P(b|s) and x、c No problem , therefore , You can also bring it up , Put it in sigma(b) Behind , So the right side of the equation is left with sigma(x) and sigma(c).

    Besides , In the figure Variable elimination It means elimination of variables . In order to better solve such problems , We have to introduce the concept of factor graph .

2.4.1 Definition of factor graph

    wikipedia This is how factor graphs are defined : Factoring a global function with many variables , We get the product of several local functions , A two-way graph based on this is called a factor graph (Factor Graph).

    such as , Suppose for a function , There is the following formula :

    among , Its corresponding factor graph Include :

  1. Variable node
  2.   factor ( function ) node
  3. edge , Edge is obtained by factoring the result as follows : In factor ( function ) node And variable nodes The necessary and sufficient condition for the existence of an edge between is There is .

    The formal definition is really obscure ! I'm sure you didn't understand . Popular speaking , The so-called factor graph is a probability graph obtained by factoring the function . Generally, there are two kinds of nodes : Variable nodes and function nodes . We know , A global function can be decomposed into multiple products of local functions by factorization , These local functions and the corresponding variable relations are reflected in the factor graph .

    for instance , Now there's a global function , Its factorization equation is :

    among fA,fB,fC,fD,fE For each function , Represents the relationship between variables , It could be conditional probability or something else ( Like the Markov random airport Markov Random Fields Potential function in ).

    For the sake of representation , It can be written. :

    The corresponding factor graph is :

 

    And the above factor graph is equivalent to :

 

    therefore , In the factor diagram , All vertices are either variable nodes or function nodes , The boundary represents the functional relationship between them .

    But for a long time , Although I know what is a factor map , But what's a factor map for ? Why introduce factor graph , What is its purpose and significance ? in fact , Factor graphs, Bayesian networks and Markov random fields (Markov Random Fields) equally , It's also a kind of probability map .

    Now that the Markov random airport is mentioned , Well, by the way, the digraph 、 Undirected graph , And conditional random fields .

  • We already know , Digraph model , It's also called Bayesian network (Directed Graphical Models, DGM, Bayesian Network).

  • But in some cases , It's not appropriate to force the direction of the edge between some nodes . Use undirected edges without direction , The undirected graph model is formed (Undirected Graphical Model,UGM), Also known as Markov random field or Markov network (Markov Random Field,  MRF or Markov network).

  • set up X=(X1,X2…Xn) and Y=(Y1,Y2…Ym) They're all joint random variables , If random variable Y To form an undirected graph G=(V,E) The Markov random airport (MRF), Then the conditional probability distribution P(Y|X) It's called conditional random airport (Conditional Random Field, abbreviation CRF, It may be explained in the following new blogs CRF). As shown in the figure below , It's a undirected graph model of linear chain conditional random field :

    Back to the main idea of this article . In the probability diagram , Find the edge distribution of a variable It's a common problem . There are many ways to solve this problem , One of them is to transform Bayesian networks or Markov random fields into factor graphs , And then use sum-product Algorithmic solution . In other words , Based on factor graph, we can use sum-product The algorithm is efficient to find the edge distribution of each variable .

    First of all, through some examples, it shows how to use Bayesian network ( And Markov random airport ), And the Markov chain 、 After the transformation of hidden Markov model into factor graph , And then in 2.4.2 section , Let's see how to use factor graph again sum-product Algorithm Find the marginal probability distribution .

    Given the Bayesian network or Markov random field shown in the figure below :

 

    According to the relationship of each variable , Available :

 

 

    The corresponding factor graph is ( The following two kinds of factor graphs can be expressed in ):

 

 

    From the above example, the method of constructing factor graph by Bayesian network is summarized :

  • A factor in Bayesian network corresponds to a node in factor graph
  • Each variable in the Bayesian network corresponds to an edge or a half edge on the factor graph
  • node g He Bian x Connected if and only if variables x Appears in the factor g in .

    Another example , For the factor graph transformed from Markov chain shown in the figure below :

 

    Yes :

 

 

    And for the factor graph transformed from hidden Markov model as shown in the figure below :

 

 

    Yes

2.4.2 Sum-product Algorithm

    We already know , For the factor graph shown in the figure below :

 

    Yes :

    below , Let's consider a question : That is, how to get the edge probability distribution from the joint probability distribution .

    First, we review the definition of joint probability and marginal probability , as follows :

  • Joint probability is the probability that two events happen together .A And B The joint probability of is expressed as perhaps .
  • Edge probability ( Also known as prior probability ) It's the probability of an event . This is how the marginal probability is obtained : In the joint probability , Merge those events that are not needed in the final result into the total probability of their events and disappear ( For discrete random variables, sum the total probability , The total probability of integration for continuous random variables ). This is called marginalization (marginalization).A The marginal probability of is expressed as P(A),B The marginal probability of is expressed as P(B). 

    in fact , A random variable fk The marginal probability of can be determined by x1,x2,x3, ..., xn The joint probability of , The specific formula is :

 

 

    Ah , What's the principle ? The principle is simple , Or is it : Yes xk Sum the probabilities of other variables , In the end xk Probability !

    Besides , In other words , If there is

 

    that

    How can the above formula further simplify the calculation ? Considering the multiplication distribution rate that we learned in primary school , You know a*b + a*c = a*(b + c), The former 2 Times multiplication 1 Time in addition , the latter 1 Times multiplication ,1 Time in addition . Can we draw lessons from the calculation of distribution rate here ? Don't worry. , And listen to the following slowly .

    Let's say that now we need to calculate the result of the following formula :

 

 

    meanwhile ,f Can be broken down as follows :

 

 

    Draw on the distribution rate , We can extract common factors :

      Because the marginal probability of a variable is equal to the product of all messages passed by the function connected to it , So the calculation gives :

    Observe the calculation process carefully , You can find , It uses something like “ The messaging ” Point of view , And there are two steps .

    First step 、 about f A breakdown of , According to the blue dotted box 、 Two... Surrounded by red dotted lines box Outside messaging :

 

    By calculation, we can get :

 

    The second step 、 According to the blue dotted box 、 Two... Surrounded by red dotted lines box Internal messaging :

 

 

    according to , We have :

 

 

    That's it , In the above calculation process, a probability distribution is written as the product of two factors , And these two factors can continue to decompose or we can get . This method of using the concept of message passing to calculate probability is sum-product Algorithm . As I said before , Based on factor graph, we can use sum-product The algorithm can efficiently find the edge distribution of each variable .

    What exactly is sum-product The algorithm ?sum-product Algorithm , Also called belief propagation, There are two kinds of news :

  • One is variable (Variable) To the function (Function) The news of :, As shown in the figure below

    here , The message from variable to function is .

  • The other is the function (Function) To variable (Variable) The news of :. As shown in the figure below :

    here , The message from function to variable is :.

    Here are sum-product The overall framework of the algorithm :

  • 1、 Given the factor graph as shown in the figure below :

 

 

  • 2、sum-product The message calculation rule of the algorithm is :

 

 

  • 3、 according to sum-product Theorem , If the function in the factor graph f There is no cycle , Then there are :

 

    It is worth mentioning that : If the factor graph is acyclic , Then we can get the edge distribution of any variable accurately , If there is a ring , Can't use sum-product The algorithm accurately calculates the edge distribution .

    such as , The Bayesian network shown in the figure below :

 

 

    When it's converted into a factor graph , by :

 

 

     You can find , If there is a Bayesian network “ Ring ”( Undirected ), Then the constructed factor graph will get rings . And the idea of using messaging , The message will go on indefinitely , It's not good for probability calculation .
    The solution is 3 individual :

  • 1、 Delete several edges in the Bayesian network , So that it does not contain an undirected ring

    For example, given the original Bayesian network shown in the left part of the figure below , By removing C and E The edge between , Make it a directed acyclic graph again , So it becomes the approximate tree structure of the right part of the graph :

    The specific transformation process is the maximum weight spanning tree algorithm MSWT( Please refer to this PPT The first 60 page ), Through this algorithm , The approximate joint probability of this tree P'(x) The joint probability with the original Bayesian network P(x) The relative entropy of ( If you forget what relative entropy is , see also : Mathematical derivation in the maximum entropy model ) Minimum .

  • 2、 Reconstruct Bayesian networks without rings
  • 3、 choice loopy belief propagation Algorithm ( You can simply understand it as sum-product Recursive version of the algorithm ), This algorithm generally selects a message in the ring , Give an initial value at random , And then use sum-product Algorithm , Go on iteratively , Because there are rings , I'm sure to get to the message just given the initial value , Then update the news , Continue iteration , Until there is no news to change . The only disadvantage is that convergence is not guaranteed , Of course , This algorithm is convergent in most cases .

    Besides , In addition to this sum-product Algorithm , One more max-product Algorithm . But just understand sum-product, So I understand max-product Algorithm . because max-product The algorithm is on it sum-product On the basis of the algorithm, the sum symbol is replaced by the maximum value max The symbol of !

    Last ,sum-product and max-product The algorithm can also be applied to the hidden Markov model hidden Markov models On , If you have a chance later, you can introduce . The end of this paper .

 

 

3 References and recommended readings

 

  1. Thomas Bayes "An essay towards solving a Problem in the Doctrine of Chances"( Bayes theorem ):http://www.sbs-bvs.be/bsn57/bsn57-6.pdf;
  2. 《 A brief history of mathematical statistics The third chapter Bayesian method 》;
  3. 《 Bayesian Statistics The poem is loose 》;
  4. “Julw” Search results for :http://www.gu1234.com/search?hl=zh-CN&site=webhp&source=hp&q=Julw&btnK=Google+%E6%90%9C%E7%B4%A2&gws_rd=ssl;
  5. Beijing 10 Month machine learning class 9 Second class , Zou Bo talks about Bayesian networks PPThttp://pan.baidu.com/s/1o69Lp1K;
  6. relevant wikipedia, For example, Bayesian theorem wiki:http://zh.wikipedia.org/zh/%E8%B4%9D%E5%8F%B6%E6%96%AF%E5%AE%9A%E7%90%86, Of Bayesian networks wiki:http://zh.wikipedia.org/wiki/%E8%B2%9D%E6%B0%8F%E7%B6%B2%E8%B7%AF. Factor chart in Chinese wiki:http://zh.wikipedia.org/zh/%E5%9B%A0%E5%AD%90%E5%9B%BE, english wik:http://en.wikipedia.org/wiki/Factor_graph.
  7. 《 Statistical decision making and Bayesian analysis James O.Berger Writing 》;
  8. Bayes theorem :http://www.guokr.com/question/547339/;
  9. Bayesian inference and its Internet application ( One ): Introduction to the theorem http://www.ruanyifeng.com/blog/2011/08/bayesian_inference_part_one.html;
  10. Bayesian inference and its Internet application ( 3、 ... and ): Spelling check http://www.ruanyifeng.com/blog/2012/10/spelling_corrector.html;
  11. Google Director of research and development Peter Norvig Explain how spell checking works :http://norvig.com/spell-correct.html;
  12. http://www.eng.yale.edu/pjk/eesrproj_02/luckenbill_html/node4.html(sum-product);
  13. Pattern Recognition and Machine Learning Chapter 8, M. Jordan, J. Kleinberg, ect, 2006;
  14. D-Separation(D Separate )-PRML-8.22-Graphical Model by Small army :http://www.zhujun.me/d-separation-separation-d.html;
  15. Introduction to factor map by Hans-Andrea Loeliger:http://www.robots.ox.ac.uk/~parg/mlrg/papers/factorgraphs.pdf;
  16. http://netclass.csu.edu.cn/jpkc2003/rengongzhineng/rengongzhineng/kejian/ai/ai/chapter4/442.htm;
  17. Bayesian network R Realization ( Bayesian networks in R)( Two )bnlearn(2):http://site.douban.com/182577/widget/notes/12817482/note/283039795/;
  18. The discussion about the difference between Bayesian school and frequency school :http://www.zhihu.com/question/20587681;
  19. factor graph, Factor map , Potential function potential function,Template models:http://www.cnblogs.com/549294286/archive/2013/06/06/3121454.html;
  20. Online Bayesian Probit Regression Introduce it Factor Graph:http://www.doingkong.com/?p=68;
  21. An Introduction to Factor Graphs,Hans-Andrea Loeliger,MLSB 2008:http://people.binf.ku.dk/~thamelry/MLSB08/hal.pdf;
  22. Factor graph and sum-product algorithm, Frank R. Kschischang, Brendan J.Frey, ect, 1998:http://filebox.vt.edu/~rmtaylor/Graphical_Modeling/Intro_and_tutorial/Kschischang_ffg_sumproduct.pdf;
  23. A Tutorial on Inference and Learning in Bayesian Networks, Irina Rish:http://www.ee.columbia.edu/~vittorio/Lecture12.pdf;
  24. Probabilistic Graphical Models Directed GMs: Bayesian Networks:http://www.cs.cmu.edu/~epxing/Class/10708/lectures/lecture2-BNrepresentation.pdf;
  25. A Brief Introduction to Graphical Models and Bayesian Networks By Kevin Murphy, 1998:http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html;
  26. Probabilistic Models for Unsupervised Learning( Understand... From a unified perspective : bayesian、MAP、ML, as well as FA、EM、PCA、ICA、GMM、HMM And so on ):http://mlg.eng.cam.ac.uk/zoubin/nipstut.pdf;
  27. PRML Probability map model reading notes :http://vdisk.weibo.com/s/DmxNcM5-7sGS;
  28. 12 month 14 Japan , Machine learning class No 15 Second class , Zou Bo talks about conditions at the airport CRF Of PPT:http://pan.baidu.com/s/1qWBdOD2.

版权声明
本文为[osc_ c0usoa3v]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207130708915e.html