当前位置:网站首页>The language model of vector: from n-gram to nnlm and rnnlm

The language model of vector: from n-gram to nnlm and rnnlm

2020-12-08 14:47:25 Thinkgamer_

This topic article will be divided into three parts , The theme of each part is :

  • word2vec Prelude to - Statistical language model
  • word2vec Detailed explanation - The style does not diminish
  • other xxx2vec Introduction to papers and Applications

It will be updated later Embedding Related articles , It could be a series of separate , It may also be put in 《 Feature Engineering -Embedding In the series 》, Welcome to continue to pay attention 「 Search and recommend Wiki


1.1、 Basic knowledge of

a) Definition

Language model (Language model) It is an important technology of natural language processing , The most common natural language processing is text data , We can think of a natural language text as a discrete time series , Suppose the length of a segment is T T T The words in the text are in turn w 1 , w 2 , . . . , w T w_1, w_2, ..., w_T w1,w2,...,wT, The language model is to calculate his probability :
P ( w 1 , w 2 , . . . , w T ) P(w_1, w_2,..., w_T) P(w1,w2,...,wT)
In other words, language model is to model the probability distribution of statements .

Language models can be divided into : Statistical language model and neural network language model .

b) Probability means

hypothesis S S S To express a meaningful sentence ,eg: It's sunny today , Suitable for outdoor mountain climbing , This sentence can be expressed as : S = w 1 , w 2 , . . . , w n S = w_1, w_2, ..., w_n S=w1,w2,...,wn, Change to the sentence in the example : w 1 = today God , w 2 = God gas , w 3 = Fine Lang , w 4 = optimum close , w 5 = Household Outside , w 6 = climb mountain w_1= today , w_2= The weather , w_3= sunny , w_4= fit , w_5= outdoors , w_6= Climbing the mountain w1= today God ,w2= God gas ,w3= Fine Lang ,w4= optimum close ,w5= Household Outside ,w6= climb mountain .

use P ( S ) P(S) P(S) The probability of the occurrence of this sentence , Begin to :
P ( S ) = P ( w 1 , w 2 , . . . , w n ) P(S)=P(w_1, w_2, ..., w_n) P(S)=P(w1,w2,...,wn)
The conditional probability can be transformed into :
P ( S ) = P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 , w 2 ) . . . P ( w n ∣ w 1 , w 2 , . . . , w n − 1 ) P(S) = P(w_1, w_2, ..., w_n) = P(w_1) P(w_2|w_1)P(w_3|w_1,w_2) ... P(w_n|w_1, w_2,...,w_{n-1}) P(S)=P(w1,w2,...,wn)=P(w1)P(w2w1)P(w3w1,w2)...P(wnw1,w2,...,wn1)
among P ( w 1 ) P(w_1) P(w1) The probability of the first word appearing , namely 「 today 」 The probability of appearing in the whole corpus , P ( w 2 ∣ w 1 ) P(w_2|w_1) P(w2w1) It means that given the first word , The probability of the second word appearing , That is, given in the whole corpus 「 today 」 The word ,「 The weather 」 The probability that this word also appears , And so on .

Among them P ( w 1 ) P(w_1) P(w1) and P ( w 2 ∣ w 1 ) P(w_2|w_1) P(w2w1) It's easy to calculate , however P ( w 3 ∣ w 1 , w 2 ) P(w_3|w_1,w_2) P(w3w1,w2) After that, there are many variables involved , The computational complexity will also become more complex .

1.2、 Statistical language model ——N-gram Model

a) Markov hypothesis

The above problem is too complicated to solve , Need to introduce Markov hypothesis , A very important point in the Markov hypothesis is that The finite field hypothesis , That is, each state is only related to the n − 1 n-1 n1 The states are related to , This is known as n n n Order Markov chain

b)n-gram

When applied to a language model , It means that the probability of each word is only the same as that of the preceding word n − 1 n-1 n1 A word has something to do with , This is called n n n Metalanguage model , When n = 2 n=2 n=2 when , It's called a binary model , At this time, the above formula is expanded to :
P ( S ) = P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 2 ) . . . P ( w n ∣ n 1 ) P(S) = P(w_1, w_2, ..., w_n) = P(w_1) P(w_2|w_1)P(w_3|w_2)...P(w_n|n_1) P(S)=P(w1,w2,...,wn)=P(w1)P(w2w1)P(w3w2)...P(wnn1)
After simplification of Markov Hypothesis , Calculation P ( S ) P(S) P(S) And it's going to be a lot easier , Of course as n n n An increase in , The corresponding computational complexity will also increase , and n n n The bigger it is , The closer we get to the true distribution of the data , n n n Usually the value is 2、3、4、5.

c) Probability estimation

Through the above description , What is clear is that :

  • Each sentence can be broken down into different word permutations
  • Each sentence can be calculated by conditional probability formula to get a rationality probability of the sentence
  • By introducing the Markov Hypothesis , Simplify the calculation probability of sentences

Take the binary model as an example , How to calculate P ( w i ∣ w i − 1 ) P(w_i|w_{i-1}) P(wiwi1)? From the probability and statistics we can see that :
P ( w i ∣ w i − 1 ) = P ( w i , w i − 1 ) P ( w i ) P(w_i|w_{i-1}) = \frac{P(w_i, w_{i-1})}{ P(w_i) } P(wiwi1)=P(wi)P(wi,wi1)
In the case of large corpus , Based on the large number theorem , words < w i , w i − 1 > <w_i, w_{i-1}> <wi,wi1> Divided by w i w_i wi The number of occurrences of can be approximately equal to P ( w i ∣ w i − 1 ) P(w_i|w_{i-1}) P(wiwi1), So there is :
P ( w i ∣ w i − 1 ) = P ( w i , w i − 1 ) P ( w i ) = N ( w i , w i − 1 ) N ( w i ) P(w_i|w_{i-1}) = \frac{P(w_i, w_{i-1})}{ P(w_i) } = \frac{N(w_i, w_{i-1})}{ N(w_i)} P(wiwi1)=P(wi)P(wi,wi1)=N(wi)N(wi,wi1)

So in general , Statistical language models require large enough corpus , The result will be more accurate . But there is also a problem here , If N ( w i , w i − 1 ) = N ( w i ) = 1 N(w_i, w_{i-1}) = N(w_i) = 1 N(wi,wi1)=N(wi)=1 Or both equal to 0 Words , The result of the calculation is obviously unreasonable . So the smoothing technique is introduced .

d)n-gram The technique of smoothing in the model

Smoothing technology is to solve c) The number of times described in the statistical ratio is not reasonable . Common smoothing techniques are ( Don't expand here , If you are interested, you can search for it ):

  • additive smoothing
  • Goodall - Turing (Good-Turing) Estimation
  • Katz Smoothing method
  • Jelinek-Mercer Smoothing method
  • Witten-Bell Smoothing method
  • Absolute impairment method
  • Kneser-Ney Smoothing method

e)n-gram The advantages and disadvantages of language model

advantage :

  • (1) Maximum likelihood estimation is used , Parameters are easy to train
  • (2) It completely includes the former n-1 All the information of a word
  • (3) Strong explanatory ability , Intuitive and easy to understand .

shortcoming :

  • (1) Lack of long-term dependence , It can only be modeled to the front n-1 Word
  • (2) With n The increase of , The parameter space grows exponentially
  • (3) Data sparseness , It is inevitable that OOV The problem of
  • (4) Purely based on statistical frequency , Poor generalization ability .

1.3、 Neural network language model ——NNLM

NNLM The paper :《A Neural Probabilistic Language Model》

Download link :http://www.ai.mit.edu/projects/jmlr/papers/volume3/bengio03a/bengio03a.pdf

NNLM The network of the model is a three-layer network structure diagram , As shown below :
NNLM Model

The bottom layer is before the output word n − 1 n-1 n1 Word ,NNLM The goal of the model is to rush this n − 1 n-1 n1 One word calculation The first t t t Word w t w_t wt Probability .

N N L M NNLM NNLM The training goal is :
f ( w t , . . . , w t − n + 1 ) = P ( w t ∣ w 1 t − 1 ) f(w_t, ..., w_{t-n+1}) = P(w_t|w_1^{t-1}) f(wt,...,wtn+1)=P(wtw1t1)
among w t w_t wt It means the first one t t t Word , w 1 t − 1 w_1^{t-1} w1t1 Says from the first 1 Words to t − 1 t-1 t1 A subsequence of words , The constraints of the model are to satisfy the needs of :

  • f ( w t , . . . , w t − n + 1 ) > 0 f(w_t, ..., w_{t-n+1}) > 0 f(wt,...,wtn+1)>0
  • ∑ 1 ∣ V ∣ f ( w t , . . . , w t − n + 1 ) = 1 \sum_{1}^{|V|}f(w_t, ..., w_{t-n+1}) = 1 1Vf(wt,...,wtn+1)=1

The model above means that when a sequence is given , By the front of t − 1 t-1 t1 Words predict the first n n n Probability of words .

  • Limit one : That is, every probability value obtained through the network should be greater than 0.
  • And for the second constraint : Because the output of our neural network model is for every t − 1 t-1 t1 Word input to predict the next , That is to say t t t What is a word . So the actual output of the model is a vector , Each component of the vector corresponds to the probability that the next word is a word in the dictionary . therefore ∣ v ∣ |v| v There must be a maximum probability in the probability value of dimension , The other probabilities are smaller .

N N L M NNLM NNLM The training goal of the model can be divided into two parts :

  • 1、 Feature mapping : the V V V No i i i A word is mapped to a m m m Dimension vector C ( i ) C(i) C(i), The mapping here can be O n e H o t OneHot OneHot( It can be an initial word vector , Because we have to train with the model ), And then we will get the feature map C ( w t − n + 1 ) , . . . , C ( w t − 1 ) C(w_{t-n+1}), ...,C(w_{t-1}) C(wtn+1),...,C(wt1) Join together to form a m ( n − 1 ) m(n-1) m(n1) Dimension vector . You could say : One from the vocabulary V V V Mapping to real vector space C C C. Through this mapping, we get the vector representation of each word .
  • 2、 Calculate the conditional probability distribution : Through a function g g g The input word vector sequence ( C ( w t − n + 1 ) , . . . , C ( w t − 1 ) ) (C(w_{t-n+1}), ...,C(w_{t-1})) (C(wtn+1),...,C(wt1)) It turns into a probability distribution y ∈ R ∣ V ∣ y \in R^{|V|} yRV, So the output here is ∣ V ∣ |V| V Dimensional , The dimensions of a dictionary are the same , y y y pass the civil examinations i i i A word means the first in a word sequence n n n The word is V i V_i Vi Probability , namely : f ( i , w t − 1 , . . . , w t − n + 1 ) = g ( i , C ( w t − 1 ) , . . . , C ( w t − n + 1 ) ) f(i, w_{t-1}, ..., w_{t-n+1}) = g(i, C(w_{t-1}), ..., C(w_{t-n+1})) f(i,wt1,...,wtn+1)=g(i,C(wt1),...,C(wtn+1))

N N L M NNLM NNLM The output of the model is a s o f t m a x softmax softmax function , Form the following :
P ( w t ∣ w t − 1 , . . . , w t − n + 1 ) = e y w t ∑ i e y w t P(w_t|w_{t-1}, ... ,w_{t-n+1}) = \frac{e^{y_{w_t}}}{ \sum_{i} e^{y_{w_t}} } P(wtwt1,...,wtn+1)=ieywteywt
among y i y_i yi It means No i i i The probability that a word is not normalized , Its calculation formula is :
y = b + W x + U   t a n h ( d + H x ) y = b+W_x + U \, tanh( d + H_x) y=b+Wx+Utanh(d+Hx)
The model parameters are : θ = ( b , d , W , U , H , C ) \theta = (b, d, W, U, H, C) θ=(b,d,W,U,H,C)

  • x x x Represents the input matrix of the neural network : x = ( C ( w t − 1 ) , . . . , C ( w t − n + 1 ) ) x = (C(w_{t-1}), ... , C(w_{t-n+1})) x=(C(wt1),...,C(wtn+1))
  • h h h Represents the number of neurons in the hidden layer
  • d d d Represents the offset parameter of the hidden layer
  • m m m Represents the vector dimension of each word
  • W W W Is an optional parameter , If there is no direct connection between the input layer and the output layer , May make W = 0 W=0 W=0
  • H H H It represents the weight matrix from input layer to hidden layer
  • U U U The weight matrix from hidden layer to output layer

The general neural network does not need to train the input , But the input in the model x x x It's a word vector , It's also a parameter that needs training , It can be seen that the weight parameters and word vectors of the model are trained at the same time , After the training of the model, the weight parameters and word vectors of the network are obtained simultaneously

N N L M NNLM NNLM The goal of training is to maximize the likelihood function , namely :
L = 1 T ∑ i l o g f ( w t , w t − 1 , . . . , w t − n + 1 , ; θ ) + R ( θ ) L = \frac{1}{T} \sum_{i} log f (w_t, w_{t-1}, ..., w_{t-n+1},; \theta) + R(\theta) L=T1ilogf(wt,wt1,...,wtn+1,;θ)+R(θ)
among θ \theta θ For all model parameters , R ( θ ) R(\theta) R(θ) Is a regularized term ( In the corresponding experiment of the paper , R R R Represents the weight attenuation parameter , It is only applicable to neural networks and vector matrices corresponding to words ).

Then use the gradient descent method to update the parameters :
θ ← θ + ϵ ϑ l o g p ( w t ∣ w t − 1 , . . . , w t − n + 1 ) ϑ θ \theta \leftarrow \theta + \epsilon \frac{ \vartheta log p(w_t|w_{t-1},...,w_{t-n+1}) }{\vartheta \theta} θθ+ϵϑθϑlogp(wtwt1,...,wtn+1)
among ϵ \epsilon ϵ For learning rate ( step ).

be based on Pytorch and Tf Implementation code reference :https://www.jianshu.com/p/be242ed3f314

1.4、 Neural network language model ——RNNLM

RNNLM The paper :《Recurrent neural network based language model》

Download link :https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

RNNLM The idea of the model is relatively simple , The main improvement is NNLM Feedforward neural network in , Its main structure is shown below :

RNNLM

At first glance, the reader may not know what this description is , take it easy , Let's fill in the simple one first RNN knowledge .

A short answer RNN The structure is shown in the following figure :
RNN structure

It contains the input layer 、 Hidden layer and output layer , X X X Represents the vector of the input layer , U U U Represents the weight matrix from the input layer to the hidden layer , S S S Represents the vector value of the hidden layer , V V V It represents the weight matrix from hidden layer to output layer , O O O Represents the vector value of the output layer , w w w Represents the last value on the hidden layer .

Expand the above figure to :

RNN Timeline expansion

Now it looks clearer , This network is in t t t Always receive input x t x_t xt after , The value of the hidden layer is s t s_t st , The output value is o t o_t ot . The key point is , s t s_t st It's not just about x t x_t xt , It also depends on s t − 1 s_{t-1} st1.

Then look at RNNLM Structure diagram of the model , I N P U T ( t ) INPUT(t) INPUT(t) That means t t t Time input x t x_t xt, C O N T E X T ( t ) CONTEXT(t) CONTEXT(t) That means t t t The hidden layer of time ( s t s_t st), C O N T E X T ( t − 1 ) CONTEXT(t-1) CONTEXT(t1) said t − 1 t-1 t1 The hidden layer value of the moment ( s t − 1 s_{t-1} st1), O U T P U T ( t ) OUTPUT(t) OUTPUT(t) That means t t t The output of time ( o t o_t ot).

among :
x ( t ) = w ( t ) + s ( t − 1 ) s j ( t ) = f ( ∑ i x i ( t ) u j i ) y k ( t ) = g ( ∑ i s j ( t ) v k j ) x(t) = w(t) + s(t-1) \\ s_j(t) = f(\sum_{i} x_i(t)u_{ji}) \\ y_k(t) = g(\sum_{i}s_j(t)v_{kj}) x(t)=w(t)+s(t1)sj(t)=f(ixi(t)uji)yk(t)=g(isj(t)vkj)
among f ( z ) f(z) f(z) It means s i g m o i d sigmoid sigmoid function :
f ( z ) = 1 1 + e − z f(z) = \frac{1}{1+e^{-z}} f(z)=1+ez1

g ( z ) g(z) g(z) It means s o f t m a x softmax softmax function :
g ( z m ) = e z m ∑ k e z k g(z_m) = \frac{e^{z_m}} {\sum_{k} e^{z_k}} g(zm)=kezkezm

There are some details to pay attention to :

  • s ( 0 ) s(0) s(0) Usually set to a smaller vector value , Such as 0.1
  • The number of hidden layer units is generally 30-500, Because the input vocabulary is very large
  • The weights are initialized using a Gaussian function with noise
  • The optimization algorithm uses random gradient descent
  • The initial value of learning rate is set to 0.1, In the following, with the iteration going on , If there is no obvious improvement , The learning rate has been revised to half of the original
  • Usually in 10-20 Convergence begins after iterations

each epoch After that , The vector error is calculated based on the cross entropy criterion :
e r r o r ( t ) = d e s i r e d ( t ) − y ( t ) error(t) = desired(t) - y(t) error(t)=desired(t)y(t)

among d e s i r e d ( t ) desired(t) desired(t) Represents the predicted value , y ( t ) y(t) y(t) For real value .

Finally, the formula for calculating the probability of word occurrence is :
P ( w i ( t + 1 ) ∣ w i , s ( t − 1 ) ) = { y r a r e ( t ) C r a r e i f   w i ( t + 1 ) i s r a r e y i ( t ) o t h e r w i s e P(w_i(t+1) | w{i},s(t-1)) = \left\{\begin{matrix} \frac{y_{rare}(t)}{C_{rare}} & if \, w_i(t+1) is rare\\ y_i(t) & otherwise \end{matrix}\right. P(wi(t+1)wi,s(t1))={ Crareyrare(t)yi(t)ifwi(t+1)israreotherwise

among C r a r e C_{rare} Crare Represents the number of words in the vocabulary that appear less than the threshold . It is mentioned in the article that B r o w n c o r p u s Brown corpus Browncorpus Data set ( Yes 80 Ten thousand words ), The threshold is set to 5, The number of hidden layer units is 100.

About RNNLM Code implementation can refer to :https://www.jianshu.com/p/f53f606944c6


【 technical service 】, Click View for details : https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg

scan Pay attention to WeChat public number ! Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles !

版权声明
本文为[Thinkgamer_]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201208144703280d.html