The language model of vector: from n-gram to nnlm and rnnlm

2020-12-08 14:47:25

This topic article will be divided into three parts , The theme of each part is ：

• word2vec Prelude to - Statistical language model
• word2vec Detailed explanation - The style does not diminish
• other xxx2vec Introduction to papers and Applications

It will be updated later Embedding Related articles , It could be a series of separate , It may also be put in 《 Feature Engineering -Embedding In the series 》, Welcome to continue to pay attention 「 Search and recommend Wiki

1.1、 Basic knowledge of

a） Definition

Language model （Language model） It is an important technology of natural language processing , The most common natural language processing is text data , We can think of a natural language text as a discrete time series , Suppose the length of a segment is T T The words in the text are in turn w 1 , w 2 , . . . , w T w_1, w_2, ..., w_T , The language model is to calculate his probability ：
P ( w 1 , w 2 , . . . , w T ) P(w_1, w_2,..., w_T)
In other words, language model is to model the probability distribution of statements .

Language models can be divided into ： Statistical language model and neural network language model .

b） Probability means

hypothesis S S To express a meaningful sentence ,eg： It's sunny today , Suitable for outdoor mountain climbing , This sentence can be expressed as ： S = w 1 , w 2 , . . . , w n S = w_1, w_2, ..., w_n , Change to the sentence in the example ： w 1 = today God , w 2 = God gas , w 3 = Fine Lang , w 4 = optimum close , w 5 = Household Outside , w 6 = climb mountain w_1= today , w_2= The weather , w_3= sunny , w_4= fit , w_5= outdoors , w_6= Climbing the mountain .

use P ( S ) P(S) The probability of the occurrence of this sentence , Begin to ：
P ( S ) = P ( w 1 , w 2 , . . . , w n ) P(S)=P(w_1, w_2, ..., w_n)
The conditional probability can be transformed into ：
P ( S ) = P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 , w 2 ) . . . P ( w n ∣ w 1 , w 2 , . . . , w n − 1 ) P(S) = P(w_1, w_2, ..., w_n) = P(w_1) P(w_2|w_1)P(w_3|w_1,w_2) ... P(w_n|w_1, w_2,...,w_{n-1})
among P ( w 1 ) P(w_1) The probability of the first word appearing , namely 「 today 」 The probability of appearing in the whole corpus , P ( w 2 ∣ w 1 ) P(w_2|w_1) It means that given the first word , The probability of the second word appearing , That is, given in the whole corpus 「 today 」 The word ,「 The weather 」 The probability that this word also appears , And so on .

Among them P ( w 1 ) P(w_1) and P ( w 2 ∣ w 1 ) P(w_2|w_1) It's easy to calculate , however P ( w 3 ∣ w 1 , w 2 ) P(w_3|w_1,w_2) After that, there are many variables involved , The computational complexity will also become more complex .

1.2、 Statistical language model ——N-gram Model

a） Markov hypothesis

The above problem is too complicated to solve , Need to introduce Markov hypothesis , A very important point in the Markov hypothesis is that The finite field hypothesis , That is, each state is only related to the n − 1 n-1 The states are related to , This is known as n n Order Markov chain

b）n-gram

When applied to a language model , It means that the probability of each word is only the same as that of the preceding word n − 1 n-1 A word has something to do with , This is called n n Metalanguage model , When n = 2 n=2 when , It's called a binary model , At this time, the above formula is expanded to ：
P ( S ) = P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 2 ) . . . P ( w n ∣ n 1 ) P(S) = P(w_1, w_2, ..., w_n) = P(w_1) P(w_2|w_1)P(w_3|w_2)...P(w_n|n_1)
After simplification of Markov Hypothesis , Calculation P ( S ) P(S) And it's going to be a lot easier , Of course as n n An increase in , The corresponding computational complexity will also increase , and n n The bigger it is , The closer we get to the true distribution of the data , n n Usually the value is 2、3、4、5.

c） Probability estimation

Through the above description , What is clear is that ：

• Each sentence can be broken down into different word permutations
• Each sentence can be calculated by conditional probability formula to get a rationality probability of the sentence
• By introducing the Markov Hypothesis , Simplify the calculation probability of sentences

Take the binary model as an example , How to calculate P ( w i ∣ w i − 1 ) P(w_i|w_{i-1}) ？ From the probability and statistics we can see that ：
P ( w i ∣ w i − 1 ) = P ( w i , w i − 1 ) P ( w i ) P(w_i|w_{i-1}) = \frac{P(w_i, w_{i-1})}{ P(w_i) }
In the case of large corpus , Based on the large number theorem , words < w i , w i − 1 > <w_i, w_{i-1}> Divided by w i w_i The number of occurrences of can be approximately equal to P ( w i ∣ w i − 1 ) P(w_i|w_{i-1}) , So there is ：
P ( w i ∣ w i − 1 ) = P ( w i , w i − 1 ) P ( w i ) = N ( w i , w i − 1 ) N ( w i ) P(w_i|w_{i-1}) = \frac{P(w_i, w_{i-1})}{ P(w_i) } = \frac{N(w_i, w_{i-1})}{ N(w_i)}

So in general , Statistical language models require large enough corpus , The result will be more accurate . But there is also a problem here , If N ( w i , w i − 1 ) = N ( w i ) = 1 N(w_i, w_{i-1}) = N(w_i) = 1 Or both equal to 0 Words , The result of the calculation is obviously unreasonable . So the smoothing technique is introduced .

d）n-gram The technique of smoothing in the model

Smoothing technology is to solve c） The number of times described in the statistical ratio is not reasonable . Common smoothing techniques are （ Don't expand here , If you are interested, you can search for it ）：

• Goodall - Turing (Good-Turing) Estimation
• Katz Smoothing method
• Jelinek-Mercer Smoothing method
• Witten-Bell Smoothing method
• Absolute impairment method
• Kneser-Ney Smoothing method

• (1) Maximum likelihood estimation is used , Parameters are easy to train
• (2) It completely includes the former n-1 All the information of a word
• (3) Strong explanatory ability , Intuitive and easy to understand .

shortcoming ：

• (1) Lack of long-term dependence , It can only be modeled to the front n-1 Word
• (2) With n The increase of , The parameter space grows exponentially
• (3) Data sparseness , It is inevitable that OOV The problem of
• (4) Purely based on statistical frequency , Poor generalization ability .

1.3、 Neural network language model ——NNLM

NNLM The paper ：《A Neural Probabilistic Language Model》

NNLM The network of the model is a three-layer network structure diagram , As shown below ：

The bottom layer is before the output word n − 1 n-1 Word ,NNLM The goal of the model is to rush this n − 1 n-1 One word calculation The first t t Word w t w_t Probability .

N N L M NNLM The training goal is ：
f ( w t , . . . , w t − n + 1 ) = P ( w t ∣ w 1 t − 1 ) f(w_t, ..., w_{t-n+1}) = P(w_t|w_1^{t-1})
among w t w_t It means the first one t t Word , w 1 t − 1 w_1^{t-1} Says from the first 1 Words to t − 1 t-1 A subsequence of words , The constraints of the model are to satisfy the needs of ：

• f ( w t , . . . , w t − n + 1 ) > 0 f(w_t, ..., w_{t-n+1}) > 0
• ∑ 1 ∣ V ∣ f ( w t , . . . , w t − n + 1 ) = 1 \sum_{1}^{|V|}f(w_t, ..., w_{t-n+1}) = 1

The model above means that when a sequence is given , By the front of t − 1 t-1 Words predict the first n n Probability of words .

• Limit one ： That is, every probability value obtained through the network should be greater than 0.
• And for the second constraint ： Because the output of our neural network model is for every t − 1 t-1 Word input to predict the next , That is to say t t What is a word . So the actual output of the model is a vector , Each component of the vector corresponds to the probability that the next word is a word in the dictionary . therefore ∣ v ∣ |v| There must be a maximum probability in the probability value of dimension , The other probabilities are smaller .

N N L M NNLM The training goal of the model can be divided into two parts ：

• 1、 Feature mapping ： the V V No i i A word is mapped to a m m Dimension vector C ( i ) C(i) , The mapping here can be O n e H o t OneHot （ It can be an initial word vector , Because we have to train with the model ）, And then we will get the feature map C ( w t − n + 1 ) , . . . , C ( w t − 1 ) C(w_{t-n+1}), ...,C(w_{t-1}) Join together to form a m ( n − 1 ) m(n-1) Dimension vector . You could say ： One from the vocabulary V V Mapping to real vector space C C . Through this mapping, we get the vector representation of each word .
• 2、 Calculate the conditional probability distribution ： Through a function g g The input word vector sequence ( C ( w t − n + 1 ) , . . . , C ( w t − 1 ) ) (C(w_{t-n+1}), ...,C(w_{t-1})) It turns into a probability distribution y ∈ R ∣ V ∣ y \in R^{|V|} , So the output here is ∣ V ∣ |V| Dimensional , The dimensions of a dictionary are the same , y y pass the civil examinations i i A word means the first in a word sequence n n The word is V i V_i Probability , namely ： f ( i , w t − 1 , . . . , w t − n + 1 ) = g ( i , C ( w t − 1 ) , . . . , C ( w t − n + 1 ) ) f(i, w_{t-1}, ..., w_{t-n+1}) = g(i, C(w_{t-1}), ..., C(w_{t-n+1}))

N N L M NNLM The output of the model is a s o f t m a x softmax function , Form the following ：
P ( w t ∣ w t − 1 , . . . , w t − n + 1 ) = e y w t ∑ i e y w t P(w_t|w_{t-1}, ... ,w_{t-n+1}) = \frac{e^{y_{w_t}}}{ \sum_{i} e^{y_{w_t}} }
among y i y_i It means No i i The probability that a word is not normalized , Its calculation formula is ：
y = b + W x + U   t a n h ( d + H x ) y = b+W_x + U \, tanh( d + H_x)
The model parameters are ： θ = ( b , d , W , U , H , C ) \theta = (b, d, W, U, H, C)

• x x Represents the input matrix of the neural network ： x = ( C ( w t − 1 ) , . . . , C ( w t − n + 1 ) ) x = (C(w_{t-1}), ... , C(w_{t-n+1}))
• h h Represents the number of neurons in the hidden layer
• d d Represents the offset parameter of the hidden layer
• m m Represents the vector dimension of each word
• W W Is an optional parameter , If there is no direct connection between the input layer and the output layer , May make W = 0 W=0
• H H It represents the weight matrix from input layer to hidden layer
• U U The weight matrix from hidden layer to output layer

The general neural network does not need to train the input , But the input in the model x x It's a word vector , It's also a parameter that needs training , It can be seen that the weight parameters and word vectors of the model are trained at the same time , After the training of the model, the weight parameters and word vectors of the network are obtained simultaneously

N N L M NNLM The goal of training is to maximize the likelihood function , namely ：
L = 1 T ∑ i l o g f ( w t , w t − 1 , . . . , w t − n + 1 , ; θ ) + R ( θ ) L = \frac{1}{T} \sum_{i} log f (w_t, w_{t-1}, ..., w_{t-n+1},; \theta) + R(\theta)
among θ \theta For all model parameters , R ( θ ) R(\theta) Is a regularized term （ In the corresponding experiment of the paper , R R Represents the weight attenuation parameter , It is only applicable to neural networks and vector matrices corresponding to words ）.

Then use the gradient descent method to update the parameters ：
θ ← θ + ϵ ϑ l o g p ( w t ∣ w t − 1 , . . . , w t − n + 1 ) ϑ θ \theta \leftarrow \theta + \epsilon \frac{ \vartheta log p(w_t|w_{t-1},...,w_{t-n+1}) }{\vartheta \theta}
among ϵ \epsilon For learning rate （ step ）.

be based on Pytorch and Tf Implementation code reference ：https://www.jianshu.com/p/be242ed3f314

1.4、 Neural network language model ——RNNLM

RNNLM The paper ：《Recurrent neural network based language model》

RNNLM The idea of the model is relatively simple , The main improvement is NNLM Feedforward neural network in , Its main structure is shown below ：

At first glance, the reader may not know what this description is , take it easy , Let's fill in the simple one first RNN knowledge .

A short answer RNN The structure is shown in the following figure ：

It contains the input layer 、 Hidden layer and output layer , X X Represents the vector of the input layer , U U Represents the weight matrix from the input layer to the hidden layer , S S Represents the vector value of the hidden layer , V V It represents the weight matrix from hidden layer to output layer , O O Represents the vector value of the output layer , w w Represents the last value on the hidden layer .

Expand the above figure to ：

Now it looks clearer , This network is in t t Always receive input x t x_t after , The value of the hidden layer is s t s_t , The output value is o t o_t . The key point is , s t s_t It's not just about x t x_t , It also depends on s t − 1 s_{t-1} .

Then look at RNNLM Structure diagram of the model , I N P U T ( t ) INPUT(t) That means t t Time input x t x_t , C O N T E X T ( t ) CONTEXT(t) That means t t The hidden layer of time （ s t s_t ）, C O N T E X T ( t − 1 ) CONTEXT(t-1) said t − 1 t-1 The hidden layer value of the moment （ s t − 1 s_{t-1} ）, O U T P U T ( t ) OUTPUT(t) That means t t The output of time （ o t o_t ）.

among ：
x ( t ) = w ( t ) + s ( t − 1 ) s j ( t ) = f ( ∑ i x i ( t ) u j i ) y k ( t ) = g ( ∑ i s j ( t ) v k j ) x(t) = w(t) + s(t-1) \\ s_j(t) = f(\sum_{i} x_i(t)u_{ji}) \\ y_k(t) = g(\sum_{i}s_j(t)v_{kj})
among f ( z ) f(z) It means s i g m o i d sigmoid function ：
f ( z ) = 1 1 + e − z f(z) = \frac{1}{1+e^{-z}}

g ( z ) g(z) It means s o f t m a x softmax function ：
g ( z m ) = e z m ∑ k e z k g(z_m) = \frac{e^{z_m}} {\sum_{k} e^{z_k}}

There are some details to pay attention to ：

• s ( 0 ) s(0) Usually set to a smaller vector value , Such as 0.1
• The number of hidden layer units is generally 30-500, Because the input vocabulary is very large
• The weights are initialized using a Gaussian function with noise
• The optimization algorithm uses random gradient descent
• The initial value of learning rate is set to 0.1, In the following, with the iteration going on , If there is no obvious improvement , The learning rate has been revised to half of the original
• Usually in 10-20 Convergence begins after iterations

each epoch After that , The vector error is calculated based on the cross entropy criterion ：
e r r o r ( t ) = d e s i r e d ( t ) − y ( t ) error(t) = desired(t) - y(t)

among d e s i r e d ( t ) desired(t) Represents the predicted value , y ( t ) y(t) For real value .

Finally, the formula for calculating the probability of word occurrence is ：
P ( w i ( t + 1 ) ∣ w i , s ( t − 1 ) ) = { y r a r e ( t ) C r a r e i f   w i ( t + 1 ) i s r a r e y i ( t ) o t h e r w i s e P(w_i(t+1) | w{i},s(t-1)) = \left\{\begin{matrix} \frac{y_{rare}(t)}{C_{rare}} & if \, w_i(t+1) is rare\\ y_i(t) & otherwise \end{matrix}\right.

among C r a r e C_{rare} Represents the number of words in the vocabulary that appear less than the threshold . It is mentioned in the article that B r o w n c o r p u s Brown corpus Data set （ Yes 80 Ten thousand words ）, The threshold is set to 5, The number of hidden layer units is 100.

About RNNLM Code implementation can refer to ：https://www.jianshu.com/p/f53f606944c6

【 technical service 】, Click View for details ： https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg

scan Pay attention to WeChat public number ！ Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles ！

https://chowdera.com/2020/12/20201208144703280d.html