# The past and present life of pre training language model - from word embedding to Bert

2021-08-08 15:42:53

The theme of this paper is the past and present life of the pre training language model , Will roughly say NLP The pre training technology in is how to develop step by step to Bert Model , It is natural to see Bert How is the idea of gradually formed ,Bert What is the historical evolution of , What did you inherit , What has been innovated , Why is the effect so good , What are the main reasons , And why model innovation is not too big , Why say Bert In recent years NLP A master of great progress . The development of pre training language model is not achieved overnight , But with words embedded 、 Sequence to sequence model and Attention From the development of .
The past and present life of pre training language model - from Word Embedding To BERT

This article consists of 24619 Word , It's not easy to code word by word , Reprint please indicate the source ：
The past and present life of pre training language model - from Word Embedding To BERT - Twenty three year old Youde

Catalog

Bert It's been very hot recently , It should be the hottest AI progress , The evaluation on the Internet is very high , From the perspective of model innovation, general , Innovation is not big . But the effect is too good , Basically a lot of new NLP The best performance of the task , Some of the missions have been blasted , This is the key . Another point is Bert With a wide range of versatility , That is to say, most of them NLP Similar two-stage mode can be adopted for tasks to improve the effect directly , The second key . Objectively speaking , hold Bert As in the last two years NLP The integrators of great progress are more in line with the facts .

The theme of this paper is the past and present life of the pre training language model , Will roughly say NLP The pre training technology in is how to develop step by step to Bert Model , It is natural to see Bert How is the idea of gradually formed ,Bert What is the historical evolution of , What did you inherit , What has been innovated , Why is the effect so good , What are the main reasons , And why model innovation is not too big , Why say Bert In recent years NLP A master of great progress .

The development of pre training language model is not achieved overnight , But with words embedded 、 Sequence to sequence model and Attention From the development of .

DeepMind Computer scientist Sebastian Ruder given 21 Since the 20th century , From the perspective of neural network technology , Milestone progress in natural language processing , As shown in the following table ：

year 2013 year 2014 year 2015 year 2016 year 2017 year
technology word2vec GloVe LSTM/Attention Self-Attention Transformer
year 2018 year 2019 year 2020 year
technology GPT/ELMo/BERT/GNN XLNet/BoBERTa/GPT-2/ERNIE/T5 GPT-3/ELECTRA/ALBERT

This article will be shown in the table above NLP The development history of Chinese technology is described one by one , because 19 Years later, most of the technology is BERT A variation of the , There will be no more description here , Readers can understand it for themselves .

One 、 Preliminary training

## 1.1 Pre training in image field

Before introducing the pre training in the image field , We first introduce the convolutional neural network （CNN）,CNN Generally used for image classification tasks , also CNN It consists of multiple hierarchies , Different layers learn different image features , The shallower the layer, the more general the features learned （ Horizontal and vertical skimming ）, The deeper the layer, the stronger the correlation between the learned features and specific tasks （ Face - Face contour 、 automobile - The outline of the car ）, As shown in the figure below ：

thus , When the leader gives us a task ： people of little importance 、 Dog 、 Ten pictures of ah Hu , Then let's design a deep neural network , The pictures of the three are classified through the network .

For the above tasks , If we design a deep neural network by ourselves, it is basically impossible , Because a weakness of deep learning is that there is a great demand for data in the training stage , The leader only gave us a total of 30 pictures , Obviously that's not enough .

Although the leaders give us little data , But can we make use of the large number of classified and labeled pictures on the Internet . such as ImageNet There is 1400 Ten thousand pictures , And these pictures have been classified and marked .

The above idea of using the existing pictures on the network is the idea of pre training , The specific method is ：

1. adopt ImageNet Data set we trained a model A
2. As mentioned above CNN The features learned in the shallow layer are particularly versatile , We can model A Make some improvements to get the model B（ The two methods ）：
1. frozen ： Shallow parameters use the model A Parameters of , Random initialization of high-level parameters , The shallow parameters remain unchanged , Then use the... Given by the leader 30 Picture training parameters
2. fine-tuning ： Shallow parameters use the model A Parameters of , Random initialization of high-level parameters , Then use the... Given by the leader 30 Picture training parameters , But here, the shallow parameters will change with the training of the task

Through the above explanation , Make a summary of image pre training （ Refer to the figure above ）： For a task with a small amount of data A, First of all, build a network with a large amount of existing data CNN Model A, because CNN The features learned in the shallow layer are particularly versatile , So we're building a CNN Model B, And the model B The shallow parameters of the model A Shallow parameters of , Model B Random initialization of high-level parameters , Then use the task by freezing or fine tuning A Data training model for B, Model B Is the corresponding task A Model of .

## 1.2 The idea of pre training

With the introduction of image domain pre training , Here we give the idea of pre training ： Mission A Corresponding model A The parameters of are no longer randomly initialized , But through the task B The model is obtained by pre training B, Then use the model B The parameters of the model A To initialize , Then pass the task A Data pair model A Training . notes ： Model B The parameters of are randomly initialized .

Two 、 Language model

Want to understand the pre training language model , First of all, we have to understand what language model is .

Generally speaking, language model is to calculate the probability of a sentence . in other words , For language sequences $$w_1,w_2,\cdots,w_n$$, The language model is to calculate the probability of the sequence , namely $$P(w_1,w_2,\cdots,w_n)$$.

Let's understand the meaning of the above description through two examples ：

1. Suppose given two sentences “ Judge the magnetism of the word ” and “ Judge the part of speech of the word ”, The language model will think that the latter is more natural . Into mathematical language, that is ：$$P( Judge , This , word , Of , The part of speech ) \gt P( Judge , This , word , Of , Magnetism )$$
2. Suppose you are given a sentence to fill in the blanks “ Judge the meaning of the word ____”, Then the problem becomes given the preceding word , Find out what the last word is , Into mathematical language is ：$$P( The part of speech | Judge , This , word , Of ) \gt P( Magnetism | Judge , This , word , Of )$$

Through the above two examples , You can give a more specific description of the language model ： Give a sentence by $$n$$ A sentence of words $$W=w_1,w_2,\cdots,w_n$$, Calculate the probability of this sentence $$P(w_1,w_2,\cdots,w_n)$$, Or calculate the probability of the next word according to the above $$P(w_n|w_1,w_2,\cdots,w_{n-1})$$.

Two branches of the language model are described below , Statistical language model and neural network language model .

## 2.1 Statistical language model

The basic idea of statistical language model is to calculate conditional probability .

Give a sentence by $$n$$ A sentence of words $$W=w_1,w_2,\cdots,w_n$$, Calculate the probability of this sentence $$P(w_1,w_2,\cdots,w_n)$$ The formula is as follows （ Generalization of conditional probability multiplication formula , The chain rule ）：

\begin{align*} P(w_1,w_2,\cdots,w_n) & = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)\cdots p(w_n|w_1,w_2,\cdots,w_{n-1}) \\ & = \prod_i P(w_i|w1,w_2,\cdots,w_{i-1}) \end{align*}

For the... Mentioned in the previous section “ Judge the part of speech of the word ” this sentence , Using the above formula , You can get ：

\begin{align*} & P( Judge , This , word , Of , The part of speech ) = \\ & P( Judge )P( This | Judge )P( word | Judge , This ) \\ & P( Of | Judge , This , word )P( The part of speech | Judge , This , word , Of )P( Judge , This , word , Of , The part of speech ) \end{align*}

For another problem mentioned in the previous section , When given the sequence of preceding words “ Judge , This , word , Of ” when , Want to know what the next word is , The following probability can be calculated directly ：

$P(w_{next}| Judge , This , word , Of )\quad\text{ The formula (1)}$

among ,$$w_{next} \in V$$ Represents the next word in the word sequence ,$$V$$ Is a person with $$|V|$$ A dictionary of words （ Word set ）.

For the formula （1）, It can be expanded into the following form ：

$P(w_{next}| Judge , This , word , Of ) = \frac{count(w_{next}, Judge , This , word , Of )}{count( Judge , This , word , Of )} \quad\text{ The formula (2)}$

For the formula （2）, You can put the dictionary $$V$$ There are many words in , As $$w_{next}$$, Bring in the calculation , Finally, take the word of maximum probability as $$w_{next}$$ Candidate words for .

If $$|V|$$ Particularly large , The formula （2） The calculation will be very difficult , But we can introduce the concept of Markov chain （ Of course , Here is just a brief talk about how to do , About the mathematical theory of Markov chain, you can check other references ）.

Suppose the dictionary $$V$$ There is “ Mars ” The word" , It is obvious that “ Mars ” It can't be in “ Judge the meaning of the word ” Back , therefore （ Mars , Judge , This , word , Of ） This combination does not exist , And there will be many similar to “ Mars ” Such a word .

further , It can be found that we put （ Mars , Judge , This , word , Of ） This combination is judged as nonexistent , Because “ Mars ” It can't be in “ Lexical ” Back , That is, we can consider whether to put the formula （1） Turn into

$P(w_{next}| Judge , This , word , Of ) \approx P(w_{next}| word , Of )\quad\text{ The formula (3)}$

The formula （3） Is the idea of Markov chain ： hypothesis $$w_{next}$$ Just like before it $$k$$ Two words are related ,$$k=1$$ It is called a unit language model ,$$k=2$$ It is called the binary language model .

It can be found that the formula rewritten after Markov chain will be much simpler to calculate , Let's take a simple example to introduce how to calculate the probability of a binary language model .

The formula of the binary language model is ：

$P(w_i|w_{i-1})=\frac{count(w_{i-1},w_i)}{count(w_{i-1})}\quad\text{ The formula (4)}$

Suppose you have a collection of text ：

“ Part of speech is a verb ”
“ Judge the part of speech of a word ”
“ A strong magnet ”
“ The part of speech in Beijing is a noun ”

1.
2.
3.
4.


For the above text , If you want to calculate $$P( The part of speech | Of )$$ Probability , Through the formula （4）, Statistics are needed “ Of , The part of speech ” The number of occurrences in order at the same time , Divided by “ Of ” Number of occurrences ：

$P( The part of speech | Of ) = \frac{count( Of , The part of speech )}{count( Of )} = \frac{2}{3}\quad\text{ The formula (5)}$

The above text set is customized by us , However, for the vast majority of texts with practical significance , There will be sparse data , For example, there is no , An unknown word appeared during the test .

Because of data sparsity , Then the probability value will be 0 The situation of （ Fill in the blanks. You will not be able to choose a word from the dictionary to fill in ）, for fear of 0 The appearance of value , Will use a smooth strategy —— Both numerator and denominator are added with a non - 0 Positive numbers , For example, you can put the formula （4） Change it to ：

$P(w_i|w_{i-1}) = \frac{count(w_{i-1},w_i)+1}{count(w_{i-1})+|V|}\quad\text{ The formula (6)}$

## Neural network language model

The previous section briefly introduced the statistical language model , At the end, it is mentioned that the statistical language model has the problem of data sparsity , For this problem , We also propose a smoothing method to deal with this problem .

Neural network language model introduces neural network architecture to estimate the distribution of words , And the similarity between words is measured by the distance of word vector , therefore , For unlisted words , It can also be estimated by similar words , So as to avoid the problem of data sparsity .

The figure above shows the structure of neural network language model , Its learning task is to input a word in a sentence $$w_t = bert$$ Before $$t-1$$ Word , Ask the network to correctly predict words “bert”, To maximize ：

$P(w_t=bert|w_1,w_2,\cdots,w_{t-1};\theta)\quad\text{ The formula (7)}$

The neural network language model shown in the figure above is divided into three layers , Next, let's explain in detail the functions of these three layers ：

1. The first layer of neural network language model , For input layer . First of all, put the front $$n-1$$ Two words with Onehot code （ for example ：0001000） Type... As the original word , Then multiply by a randomly initialized matrix Q Get the word vector $$C(w_i)$$, For this $$n-1$$ A word vector is processed to get the input $$x$$, Write it down as $$x=(C(w_1),C(w_2),\cdots,C(w_{t-1}))$$
2. The second layer of neural network language model , For hidden layer , contain $$h$$ An implicit variable ,$$H$$ Represents the weight matrix , Therefore, the output of the hidden layer is $$Hx+d$$, among $$d$$ For the offset term . And then use $$tanh$$ As an activation function .
3. The third layer of neural network language model , For the output layer , Altogether $$|V|$$ Output nodes （ Dictionary size ）, Intuitively speaking , Each output node $$y_i$$ Is the probability value of each word in the dictionary . The final calculation formula is ：$$y = softmax(b+Wx+U\tanh(d+Hx))$$, among $$W$$ Is the weight matrix directly from the input layer to the output layer ,$$U$$ Is the parameter matrix from the hidden layer to the output layer .
3、 ... and 、 The word vector

When describing the neural network language model , mention Onehot Coding and word vectors $$C(w_i)$$, But there was no specific mention of what they were .

Because they are interested in the future BERT Your explanation is very important , So here we reopen a chapter to describe what word vectors are , How to say .

## 3.1 Fever alone （Onehot） code

Express the word as a vector , It is a core technology of introducing deep neural network language model into the field of natural language processing .

In natural language processing tasks , Most of the training sets are one word or one word , It is very important to convert them into numerical data suitable for computer processing .

In the early , The way people think of is to use heat alone （Onehot） code , As shown in the figure below ：

For the explanation of the figure above , Let's say I have one that contains 8 A second dictionary $$V$$,“time” Located on the... Of the dictionary 1 A place ,“banana” Located on the... Of the dictionary 8 A place , therefore , Single heat representation method is adopted , about “time” In terms of vectors , Except for 1 The positions are 1, The rest are 0; about “banana” In terms of vectors , Except for 8 The positions are 1, The rest are 0.

however , For the vector of the heat only representation , If cosine similarity is used to calculate the similarity between vectors , It can be clearly found that the similarity results of any two vectors are 0, That is, neither of them is related , In other words, Du re expression cannot solve the problem of similarity between words .

## 3.2 Word Embedding

Due to the unique heat expression, it can not solve the problem of similarity between words , This representation was quickly replaced by word vector representation , At this time, you may think of a word vector in the neural network language model $$C(w_i)$$, Right , This $$C(w_i)$$ It's actually the word corresponding to Word Embedding value , That's the core of our section —— The word vector .

In the neural network language model , We didn't explain in detail how word vectors are calculated , Now let's revisit the architecture of the neural network language model ：

As shown in the figure above, there is a $$V×m$$ Matrix $$Q$$, This matrix $$Q$$ contain $$V$$ That's ok ,$$V$$ Represents the size of a dictionary , The content of each line represents the... Of the corresponding word Word Embedding value .

It's just $$Q$$ The content of is also a network parameter , Need to learn to get , The training just started to initialize the matrix with random values $$Q$$, When the network is trained , matrix $$Q$$ The content of is correctly assigned , Each line represents a word corresponding to Word embedding value .

But does this word vector solve the problem of similarity between words ？ To answer this question , We can look at the calculation of word vectors ：

$\begin{bmatrix} 0&0&0&1&0 \end{bmatrix} \begin{bmatrix} 17&24&1\\ 23&5&7\\ 4&6&13\\ 10&12&19\\ 11&18&25 \end{bmatrix} = \begin{bmatrix} 10&12&19 \end{bmatrix} \quad\text{ The formula (8)}$

Through the calculation of the above word vector , You can find the 4 The word vector of a word is expressed as $$[10\,12\,19]$$.

If cosine similarity is used again to calculate the similarity between two words , The result is no longer 0 , It can describe the similarity between two words to a certain extent .

The following figure shows some examples found online , It can be seen that some examples are still very good , A word is expressed as Word Embedding after , It's easy to find other words with similar semantics .

Four 、Word2Vec Model

2013 The most popular language model in Word Embedding My tool is Word2Vec , And then there was Glove（ because Glove and Word2Vec Has a similar effect , Also on BERT The explanation was not helpful , No more narration later ）,Word2Vec How does it work ？ Look at the picture below ：

Word2Vec The network structure is actually similar to the neural network language model （NNLM） It's basically similar , It's just that the picture is a little unclear , It doesn't look like it , In fact, they are brothers . However, it should be pointed out here ： Although the network structure is similar , And they all do language model tasks , But their training methods are different .

Word2Vec There are two training methods ：

1. The first is called CBOW, The core idea is to cut out a word from a sentence , Use the above and below of the word to predict the word to be cut out ;
2. The second is called Skip-gram, and CBOW Just the opposite , Enter a word , Ask the network to predict its contextual words .

And you look back ,NNLM How to train ？ Is to enter a word above , To predict the word . There is a significant difference .

Why? Word2Vec Do this ？ The reason is simple , because Word2Vec and NNLM Dissimilarity ,NNLM The main task of is to learn a network structure to solve the language model task , The language model is to see the above prediction and the following , and Word Embedding It's just NNLM A by-product of inadvertent intervention ; however Word2Vec The goals are different , It simply wants to Word Embedding Of , This is the main product , So it can train the network at will .

Why talk about Word2Vec Well ？ Here is mainly to lead to CBOW Training methods ,BERT In fact, it has something to do with it , The relationship between them will be explained later , Of course, their relationship BERT The author didn't say , I guess , As for whether my guess is right , You can judge for yourself after reading this article .

5、 ... and 、 Pre training model of natural language processing

Suddenly insert this paragraph into the article , In fact, it is to give a question ：Word Embedding Can this practice be regarded as pre training ？ This is actually the standard pre training process . To understand this, you have to learn Word Embedding How downstream tasks use it .

Suppose as shown in the figure above , We have a NLP Downstream tasks of , such as QA, Just ask and answer questions , The so-called question and answer , It means that given a problem X, Give another sentence Y, To judge a sentence Y Is it a problem X The right answer .

Q & A questions assume that the designed network structure is shown in the figure above , Let's not talk about it here , Understand nature and understand , It doesn't matter what you don't understand , Because this is not the key to the purpose of this paper , The key is how the network uses the trained Word Embedding Of .

Its usage is actually the same as that mentioned above NNLM It's the same , Each word in the sentence begins with Onehot Form as input , Then multiply by what you learn well Word Embedding matrix Q, Just take out the word corresponding to Word Embedding 了 .

At first glance, it looks like a table lookup operation , It's not like pre training, is it ？ It's not , that Word Embedding matrix Q It's actually the Internet Onehot Layer to embedding Network parameter matrix of layer mapping .

So you see , Use Word Embedding Equivalent to what ？ Is equivalent to Onehot Layer to embedding The layer network uses the pre trained parameter matrix Q Initialize the . This is actually the same as the low-level pre training process in the image field mentioned earlier , The difference is nothing more than Word Embedding Only layer 1 network parameters can be initialized , There's nothing you can do about the parameters at the next higher level .

The downstream NLP The task is using Word Embedding There are two ways to create similar images , One is Frozen, Namely Word Embedding The network parameters of that layer are fixed ; The other is Fine-Tuning, Namely Word Embedding The parameters of this layer need to be updated with the new training set .

The above method is 18 Years ago NLP Typical practices of pre training in the field , also Word Embedding In fact, for many downstream NLP The task is helpful , It's just to help the onlookers who don't forget to wear sunglasses .

6、 ... and 、RNN and LSTM

Why insert a... Here RNN（Recurrent Neural Network） and LSTM（Long Short-Term Memory） Well ？

Because what I'm going to introduce next ELMo（Embeddings from Language Models） The model uses two-way long-term and short-term memory network in the training process （Bi-LSTM）.

Of course , Here is just a brief introduction to , If you want to know more, you can check the overwhelming references on the Internet .

## 6.1 RNN

The traditional neural network can not obtain the timing information , However, temporal information is very important in natural language processing tasks .

For example, for this sentence “ I ate an apple ”,“ Apple ” Part of speech and meaning , Here depends on the information of the preceding words , without “ I ate one ” These words ,“ Apple ” The apple that jobs made was bitten .

in other words ,RNN Appearance , Make it possible to process timing information .

RNN The basic unit structure of is shown in the figure below ：

The left part of the figure above is called RNN One of the timestep, In this timestep Can be seen in , stay $$t$$ moment , The input variable $$x_t$$, adopt RNN A basic module of A, Output variables $$h_t$$, and $$t$$ The information of the moment , Will pass to the next moment $$t+1$$.

If the modules are expanded in sequence , It will be as shown in the right part of the above figure , From this we can see that RNN For multiple basic modules A Interconnection of , Each module will pass the current information to the next module .

RNN The problem of timing dependency is solved , But the timing here generally refers to short-distance , First, let's introduce the difference between short-distance dependence and long-distance dependence ：

• Short distance dependence ： For this blank filling question “ I want to see a basketball game ____”, We can easily judge “ Basketball ” It's followed by “ match ”, This short-range dependency problem is very suitable for RNN.
• Long distance dependence ： For this blank filling question “ I was born in Jingdezhen, the porcelain capital of China , Primary and secondary schools are close to home ,……, My mother tongue is ____”, For short distance dependence ,“ My mother tongue is ” It can be followed by “ chinese ”、“ English ”、“ French ”, But if we want a precise answer , Then we must go back to the expression long ago “ I was born in Jingdezhen, the porcelain capital of China ”, Then judge the answer as “ chinese ”, and RNN It's hard to learn this information .

## 6.2 RNN The problem of gradient disappearance

Here I will briefly explain RNN Why is it not suitable for long-distance dependence .

As shown in the figure above , by RNN Model structure , The forward propagation process includes ：

• Hidden state ：$$h^{(t)} = \sigma (z^{(t)}) = \sigma(Ux^{(t)} + Wh^{(t-1)} + b)$$ , The activation function here is generally $$tanh$$ .
• Model output ：$$o^{(t)} = Vh^{(t)} + c$$
• Forecast output ：$$\hat{y}^{(t)} = \sigma(o^{(t)})$$ , The activation function here is generally softmax.
• Model loss ：$$L = \sum_{t = 1}^{T} L^{(t)}$$

RNN be-all timestep Share a set of parameters $$U,V,W$$, stay RNN In the process of back propagation , Need to compute $$U,V,W$$ Gradient of equal parameters , With $$W$$ As an example （ hypothesis RNN The loss function of the model is $$L$$）：

$\frac{\partial L}{\partial W} = \sum_{t = 1}^{T} \frac{\partial L}{\partial y^{(T)}} \frac{\partial y^{(T)}}{\partial o^{(T)}} \frac{\partial o^{(T)}}{\partial h^{(T)}} \left( \prod_{k=t + 1}^{T} \frac{\partial h^{(k)}}{\partial h^{(k - 1)}} \right) \frac{\partial h^{(t)}}{\partial W} \ \\ = \sum_{t = 1}^{T} \frac{\partial L}{\partial y^{(T)}} \frac{\partial y^{(T)}}{\partial o^{(T)}} \frac{\partial o^{(T)}}{\partial h^{(T)}} \left( \prod_{k=t+1}^{T} tanh^{'}(z^{(k)}) W \right) \frac{\partial h^{(t)}}{\partial W} \ \\ \quad\text{ The formula (9)}$

For the formula （9） Medium $$\left( \prod_{k=t + 1}^{T} \frac{\partial h^{(k)}}{\partial h^{(k - 1)}} \right) = \left( \prod_{k=t+1}^{T} tanh^{'}(z^{(k)}) W \right)$$,$$\tanh$$ The derivative of is always less than 1 Of , Because it is $$T-(t+1)$$ individual timestep Concatenation of parameters , If $$W$$ The principal eigenvalue of is less than 1, The gradient will disappear ; If $$W$$ The eigenvalue of is greater than 1, The gradient will explode .

It should be noted that ,RNN and DNN Gradient disappearance and gradient explosion have different meanings .

RNN The middle weight is shared in each time step , The final gradient is the sum of the gradients of each time step , The gradient and will get bigger and bigger . therefore ,RNN The total gradient in will not disappear , Even if the gradient gets weaker and weaker , It's just that the long-distance gradient disappears . From formula （9） Medium $$\left( \prod_{k=t+1}^{T} tanh^{'}(z^{(k)}) W \right)$$ You can see ,RNN The real meaning of the so-called gradient disappearance is , The gradient is close （$$t+1 Tend to T$$） Gradient dominates , Remote （$$t+1 Stay away from T$$） The gradient is very small , It makes it difficult for the model to learn long-distance information .

## 6.3 LSTM

In order to solve RNN Lack of sequence long-distance dependence ,LSTM Was brought up , First let's take a look LSTM be relative to RNN What improvements have been made ：

As shown in the figure above , by LSTM Of RNN Gating structure （LSTM Of timestep）,LSTM The forward propagation process includes ：

• Oblivion gate ： Decide what information to discard , Forget the door to receive $$t-1$$ The state of the moment $$h_{t-1}$$, And the current input $$x_t$$, after Sigmoid The function outputs a 0 To 1 Between the value of the $$f_t$$
• Output ： $$f_{t} = \sigma(W_fh_{t-1} + U_fx_{t} + b_f)$$
• Input gate ： Determines what new information is retained , And update the cell status , The values of the inputs are determined by $$h_{t-1}$$ and $$x_t$$ decision , adopt Sigmoid Function to get a 0 To 1 Between the value of the $$i_t$$, and $$\tanh$$ The function creates a candidate for the current cell state $$a_t$$
• Output ：$$i_{t} = \sigma(W_ih_{t-1} + U_ix_{t} + b_i)$$ , $$\tilde{C_{t} }= tanhW_ah_{t-1} + U_ax_{t} + b_a$$
• Cell state ： Old cell state $$C_{t-1}$$ Updated to the new cell state $$C_t$$ On ,
• Output ：$$C_{t} = C_{t-1}\odot f_{t} + i_{t}\odot \tilde{C_{t} }$$
• Output gate ： Determines the final output information , The value of output gate is determined by $$h_{t-1}$$ and $$x_t$$ decision , adopt Sigmoid Function to get a 0 To 1 Between the value of the $$o_t$$, Finally through $$\tanh$$ Function determines the last output information
• Output ：$$o_{t} = \sigma(W_oh_{t-1} + U_ox_{t} + b_o)$$ , $$h_{t} = o_{t}\odot tanhC_{t}$$
• Forecast output ：$$\hat{y}_{t} = \sigma(Vh_{t}+c)$$

## 6.4 LSTM solve RNN The problem of gradient disappearance

got it RNN After the reason why the gradient disappeared , Let's see LSTM How to solve the problem ？

RNN The reason why the gradient disappears is , With the conduction of the gradient , Gradients are dominated by close gradients , It is difficult for the model to learn long-distance information . The specific reason is $$\prod_{k=t+1}^{T}\frac{\partial h_{k}}{\partial h_{k - 1}}$$ part , In the iteration process , Each step $$\frac{\partial h_{k}}{\partial h_{k - 1}}$$ Always in [0,1) Between or always greater than 1.

And for LSTM In terms of model , in the light of $$\prod _{k=t+1}^{T} \frac{\partial C_{k}}{\partial C_{k-1}}$$ Get ：

\begin{align} & \frac{\partial C_{k}}{\partial C_{k-1}} = f_k + other \\ & \prod _{k=t+1}^{T} \frac{\partial C_{k}}{\partial C_{k-1}} = f_{k}f_{k+1}...f_{T} + other \\ \end{align}

stay LSTM During iteration , in the light of $$\prod_{k=t+1}^{T} \frac{\partial C_{k}}{\partial C_{k-1}}$$ for , Each step $$\frac{\partial C_{k}}{\partial C_{k-1}}$$ Can choose independently in [0,1] Between , Or greater than 1, because $$f_{k}$$ It's trainable . Then the whole $$\prod _{k=t+1}^{T} \frac{\partial C_{k}}{\partial C_{k-1}}$$ It won't always decrease , The long-range gradient will not disappear completely , Can solve RNN The gradient vanishing problem in .

LSTM Forget the gate value $$f_t$$ You can choose [0,1] Between , Give Way LSTM To improve the disappearance of the gradient . You can also choose to approach 1, Saturate the forgetting gate , At this time, the long-distance information gradient does not disappear ; You can also choose to approach 0, At this time, the model deliberately blocks the gradient flow , Forget the previous information .

In addition, it should be emphasized that LSTM It's so complicated , In addition to naturally overcoming the problem of gradient disappearance in structure , More importantly, there are more parameters to control the model ; By four times RNN Parameters of , Time series variables can be predicted more finely .

Besides , I remember an article about ,LSTM stay 200 Left and right length of text , Already stretched out .

7、 ... and 、ELMo Model

## 7.1 ELMo Pre training

In the interpretation of the Word Embedding when , Careful readers must have found , These word expressions are static in nature , Every word has a unique word vector , It can't be changed according to different sentences , Unable to deal with polysemy in natural language processing tasks .

As shown in the figure above , For example, polysemy Bank, There are two common meanings , however Word Embedding In the face of bank When this word is encoded , It's impossible to distinguish these two meanings .

Because although these two sentences contain bank In the sentence bank Different words appear in the context , But when training with language models , No matter what context the sentence goes through Word2Vec, They all predict the same words bank, The same word occupies the parameter space of the same line , This will cause two different context information to be encoded into the same Word Embedding In space , Leading to Word Embedding It's impossible to distinguish the different meanings of polysemous words .

So for example Bank The word , It learned well in advance Word Embedding There are several kinds of semantics mixed in , There is a new sentence in the application , Even from the context （ For example, sentences contain money Equivalent words ） Obviously it represents “ Bank ” The meaning of , But the corresponding Word Embedding The content will not change , It's still a mixture of semantics .

in the light of Word Embedding The problem of polysemy in ,ELMo Provides a simple and elegant solution .

ELMo The essence of the idea is ： I learned a word by using a language model in advance Word Embedding, At this time, polysemous words cannot be distinguished , But it doesn't matter . In my actual use Word Embedding When , The word already has a specific context , At this time, I can adjust the meaning of words according to the semantics of context words Word Embedding Express , This adjusted Word Embedding Can express the specific meaning in this context , Naturally, the problem of polysemy is solved . therefore ELMo In itself, it is based on the current context Word Embedding The idea of dynamic adjustment .

ELMo A typical two-stage process ：

1. The first stage is to use language model for pre training ;
2. The second stage is in the downstream task , From the pre training network, extract the corresponding words of each layer of the network Word Embedding As a new feature added to the downstream mission .

The figure above shows the first stage of pre training , Its network structure adopts two-layer two-way LSTM, At present, the task goal of language model training is based on the words $$w_i$$ To predict the words correctly $$w_i$$,$$w_i$$ The previous word sequence Context-before Called the above , The following word sequence Context-after Called the following .

The forward double layer at the left end of the picture LSTM For positive direction encoder , The input is from left to right except for the prediction words $$W_i$$ Above Context-before; The reverse double layer at the right end LSTM Represents the reverse encoder , Input the reverse sentences from right to left Context-after; The depth of each encoder is two layers LSTM superposition .

This network structure is actually NLP Is very commonly used in . Using this network structure to do language model tasks with a large number of corpus can train the network in advance , After training the network , Enter a new sentence $$s_{new}$$ , Each word in the sentence can get three corresponding Embedding：

• At the bottom is the word Word Embedding;
• Going up is the first two-way LSTM Corresponding to the position of the word Embedding, This layer encodes more syntactic information of words ;
• Going up is the second floor LSTM Corresponding to the position of the word Embedding, This layer encodes more semantic information of words .

in other words ,ELMo The pre training process is not just about learning words Word Embedding, Also learned a double-layer two-way LSTM Network structure , And both of them are useful .

## 7.2 ELMo Of Feature-based Pre-Training

What is introduced above is ELMo The first stage of ： Pre training stage . After the network structure is pre trained , How to use it for downstream tasks ？

The figure above shows the use process of the downstream task , For example, our downstream task is still QA problem , Now for the question X：

1. We can first put the sentence X As a pre trained ELMo Network input , This sentence X Every word in is in ELMO You can get three of them in the network Embedding;
2. Then give these three Embedding Each of them Embedding A weight a, This weight can be learned , To add up according to one's rights , Will three Embedding Integrate into one ;
3. And then we will integrate this Embedding As X The input of the corresponding word in the network structure of the task , As a supplementary new feature, it can be used for downstream tasks .
4. For the downstream tasks shown in the figure above QA Answer sentences in Y That's how it's handled .

because ELMo What is provided to the downstream is the characteristic form of each word , So this kind of pre training method is called “Feature-based Pre-Training”.

As for why this can achieve the effect of distinguishing polysemy , The reason is that after training ELMo after , In feature extraction , Each word is on two levels LSTM There will be corresponding nodes on , These two nodes will encode some syntactic and semantic features of the word , And their Embedding Coding changes dynamically , Will be affected by contextual words , The different contexts of the surrounding words should strengthen some semantics , Weaken other semantics , Then it solves the problem of polysemy .

8、 ... and 、Attention

There is a pile of , All for BERT Pave the way for your explanation , And what follows is Attention and Transformer The same is true of , They're all just BERT Part of .

## 8.1 Human visual attention

Attention It means attention , From the way it is named , Obviously, it draws on the human attention mechanism , therefore , We first introduce human visual attention .

Visual attention mechanism is a special brain signal processing mechanism of human vision . Human vision scans global images quickly , Get the target areas that need to be focused on , It's also known as the focus of attention , And then invest more attention and resources in this area , To get more details about the target you need to focus on , And suppress other useless information .

This is a way for human beings to use limited attention resources to quickly select high-value information from a large number of information , It's a survival mechanism formed in the long-term evolution of human beings , Human visual attention mechanism greatly improves the efficiency and accuracy of visual information processing .

The image above shows how human beings efficiently allocate limited attention resources when seeing an image , The red area indicates that the visual system pays more attention to the target , Obviously, for the scenario shown above , People pay more attention to people's faces , The title of the text and the first sentence of the article .

The attention mechanism in deep learning is essentially similar to the selective visual attention mechanism of human beings , The core goal is to select the information that is more critical to the current task goal from a large number of information .

## 8.2 Attention The essence of thinking

It can be seen from human visual attention , Attention model Attention The essential thought of is ： Selectively filter out a small amount of important information from a large amount of information and focus on these important information , Ignore unimportant information .

In the detailed explanation Attention Before , We're talking about Attention Other functions of . We explained before LSTM When it comes to , although LSTM It solves the problem of long-distance dependence of sequences , But words are more than 200 It will fail when . and Attention Mechanism can better solve the problem of sequence long-distance dependence , And it has the ability of parallel computing . It doesn't matter if you don't understand now , With us to Attention Slowly and deeply , I believe you will understand .

First of all, we have to make a point , The attention model starts with a lot of information Values Filter out a small amount of important information , This important information must be relative to another information Query Is important for , For example, for the baby picture above ,Query It's the observer . in other words , We're going to build an attention model , We have to have a Query And a Values, And then through Query This information comes from Values Filter out important information .

adopt Query This information comes from Values Filter out important information , To put it simply , It's calculation Query and Values The relevance of each piece of information in .

More specifically ,Attention Generally, it can be described as follows , It means to Query(Q) and key-value pairs（ hold Values Split into key value pairs ） Map to output , among query、 Every key、 Every value Is a vector , The output is V All in values A weighted , Where the weight is determined by Query And each key Calculated , The calculation method is divided into three steps ：

1. First step ： Calculation comparison Q and K The similarity , use f To express ：$$f(Q,K_i)\quad i=1,2,\cdots,m$$, Generally, the first step calculation method includes four methods
1. Point multiplication （Transformer Use ）：$$f(Q,K_i) = Q^T K_i$$
2. The weight ：$$f(Q,K_i) = Q^TWK_i$$
3. Splice weights ：$$f(Q,K_i) = W[Q^T;K_i]$$
4. perceptron ：$$f(Q,K_i)=V^T \tanh(WQ+UK_i)$$
2. The second step ： Compare the obtained similarity softmax operation , Normalize ：$$\alpha_i = softmax(\frac{f(Q,K_i)}{\sqrt d_k})$$
1. Here's a simple explanation $$\sqrt d_k$$ The role of ： hypothesis $$Q$$ , $$K$$ The average value of the elements in is 0, The variance of 1, that $$A^T=Q^TK$$ The average value of the elements in is 0, The variance of d. When d When it gets big , $$A$$ The variance of the elements in will also become very large , If $$A$$ The variance of the elements in is very large ( The variance of the distribution is large , The distribution is concentrated in areas with large absolute values ), In the larger order of magnitude , softmax Almost all probability distributions are assigned to the label corresponding to the maximum value , Because the order of magnitude of a dimension is large , Which in turn leads to softmax The gradient will disappear in the future . So in summary $$\operatorname{softmax}\left(A\right)$$ The distribution of will be similar to d of . therefore $$A$$ Multiply each element in the $$\frac{1}{\sqrt{d_k}}$$ after , The variance becomes 1, also $$A$$ The order of magnitude will also become smaller .
3. The third step ： For the calculated weight $$\alpha_i$$, Yes $$V$$ All in values Perform weighted summation calculation , obtain Attention vector ：$$Attention = \sum_{i=1}^m \alpha_i V_i$$

## 8.3 Self Attention Model

It's already said Attention Is to filter out important information from a pile of information , Now let's go through Self Attention Model to explain in detail how to find these important information .

Self Attention The architecture of the model is shown in the figure below , Next, we will explain one by one according to the order of the model architecture .

First you can see Self Attention There are three inputs Q、K、V： about Self Attention,Q、K、V From the sentence X Of The word vector x Linear transformation of , That is, for word vectors x, Given three learnable matrix parameters $$W_Q,W_k,W_v$$,x Right multiply the above matrices to get Q、K、V.

Next, for the convenience of expression , Let's start with the calculation of vectors Self Attention Computational flow , And then describe it Self Attention Matrix calculation process

1. First step ,Q、K、V Acquisition

1.

The operation above ： Two words Thinking and Machines. By linear transformation , namely $$x_i$$ and $$x_2$$ Two vectors are associated with $$W_q,W_k,W_v$$ Multiply three matrix points to get ${q_1,q_2},{k_1,k_2},{v_1,v_2}$ common 6 Vector . matrix Q It's a vector $$q_1,q_2$$ The joining together of ,K、V Empathy .

2. The second step ,MatMul

1. The operation above ： vector $${q_1,k_1}$$ Do some multiplication to get a score 112, $${q_1,k_2}$$ Do some multiplication to get a score 96. Be careful ： This is through $$q_1$$ This information can be found $$x_1,x_2$$ Important information in .

3. Step three and step four ,Scale + Softmax

The operation above ： Standardize the score , Divide $$\sqrt {d_k} = 8$$

4. Step five ,MatMul

Use the score ratio [0.88,0.12] multiply $$[v_1,v_2]$$ Value to a weighted value , Add these values together to get $$z_1$$.

The above is Self Attention What the model does , Feel it carefully , use $$q_1$$、$$K=[k_1,k_2]$$ To calculate a Thinking be relative to Thinking and Machine The weight of , Then multiply the weight by Thinking and Machine Of $$V=[v_1,v_2]$$ Get the weighted Thinking and Machine Of $$V=[v_1,v_2]$$, Finally, sum to get the output for each word $$z_1$$.

Similarly, we can calculate Machine be relative to Thinking and Machine Weighted output of $$z_2$$, Splicing $$z_1$$ and $$z_2$$ You can get Attention value $$Z=[z_1,z_2]$$, This is it. Self Attention Matrix calculation of , As shown below .

The previous example is an operation example of a single vector . This figure shows an example of matrix operation , The input is a [2x4] Matrix （ The splicing of word vectors for each word in a sentence ）, Each operation is [4x3] Matrix , Get Q、K、V.

Q Yes K Transform to do dot multiplication , Divide $$\sqrt d_k$$, Make one softmax Get combined into 1 The proportion of , Yes V Do some multiplication to get the output Z. So this Z It's a thought Thinking Words around Machine Output .

Look at this formula ,$$QK^T$$ In fact, it will form a word2word Of attention map！（ added softmax Then there is a combination of 1 The weight of ）. For example, your input is a sentence "i have a dream" in total 4 Word , Here will form a 4x4 A diagram of the attention mechanism ：

thus , Each word has a weight corresponding to each word , This is also Self Attention The source of the name , namely Attention The calculation of comes from Source（ Source sentence ） and Source In itself , In layman's terms Q、K、V All come from input X In itself .

## 8.4 Self Attention and RNN、LSTM The difference between

introduce Self Attention What are the benefits ？ Or through Self Attention What rules have you learned or what features have you extracted ？ We can explain it through the following two pictures ：

As can be seen from the above two figures ,Self Attention It can capture some syntactic features between words in the same sentence （ For example, the first picture shows the phrase structure with a certain distance ） Or semantic features （ For example, the second picture shows its The reference object of is Law）.

With the above explanation , We can now take a look at Self Attention and RNN、LSTM The difference between ：

• RNN、LSTM： If it is RNN perhaps LSTM, You need to calculate in sequence , For long-distance interdependent features , It takes a number of time steps of information accumulation to connect the two , And the further away , The less likely it is to capture effectively .
• Self Attention：
• Through the above two pictures , It is obvious that , introduce Self Attention It is easier to capture the long-distance interdependent features in a sentence , because Self Attention In the calculation process, the connection between any two words in the sentence will be directly linked through a calculation step , So the distance between remote dependent features is greatly reduced , It's conducive to the effective use of these features ;
• besides ,Self
Attention For each word in a sentence, you can do it separately Attention The calculation of the value , in other words Self Attention It is also directly helpful to the parallelism of Computing , And for those that must be calculated in sequence RNN for , Parallel computing is impossible .

As mentioned above , Is why Self Attention Gradually replace RNN、LSTM The reason why it is widely used .

## 8.5 Masked Self Attention Model

Strike while the iron is hot , Let's talk about it Transformer It will be used in the future Masked Self Attention Model , there Masked Is to make a language model （ Or like a translator ） When , Don't give the model information about the future , Its structure is shown in the figure below ：

Neutralization in the picture above Self Attention The repeated part will not be discussed here , Mainly about Mask This piece of .

Suppose we have passed before scale The previous step got a attention map, and mask Is to use the gray area along the diagonal 0 overwrite , Don't give the model information about the future , As shown in the figure below ：

In detail ：

1. "i" As the first word , There can only be and "i" Their own attention;
2. "have" As the second word , There are and "i、have" The first two words attention;
3. "a" As the third word , There are and "i、have、a" The first three words attention;
4. "dream" As the last word , For the whole sentence 4 A word of attention.

And finish softmax after , The result of the horizontal axis is 1. As shown in the figure below ：

Specifically why mask, I'll explain it in the future Transformer We'll explain in detail when we .

## 8.6 Multi-head Self Attention Model

because Transformer All of them are Self Attention An advanced version of Multi-head Self Attention, Let's talk briefly about Multi-head Self Attention The architecture of , And at the end of the section, I'll talk about its advantages .

Multi-Head Attention Is to put Self Attention The process of doing H Time , Then output Z Close . In the paper , Its structure is shown below ：

Let's explain it in the above form , First , We use 8 Different groups $$W_Q^i,W_k^i,W_V^i\quad i=1,2,\cdots,8$$ , repeat 8 Secondary sum Self Attention Similar operation , obtain 8 individual $$Z_i$$ matrix ：

In order to make the output and input structure the same , Splice matrix $$Z_i$$ Then multiply by a linear $$W_0$$ To get the final Z：

It's over Multi-head Self Attention The architecture of , Find out that it has something to do with Self Attention The difference between , It's the use of $$n$$ Group $$W_Q^i,W_k^i,W_V^i\quad i=1,2,\cdots,n$$ obtain $$n$$ Group $$Q_i,K_i,V_i \quad i=1,2,\cdots,n$$.

You can see through the figure below multi-head attention The whole process of ：

What are the benefits of the above operation ？ Use multiple sets of parameters , Multiple sets of parameters are equivalent to the original information Source Into multiple subspaces , That is to capture multiple information , For the use of multi-head（ long position ） attention The simple answer is , Bulls guarantee attention You can notice the information of different subspaces , Capture richer feature information . In fact, the original author of the paper found that this effect is really good .

Nine 、Position Embedding

stay Attention and RNN、LSTM In contrast to , We said to the Attention Solved the problem of long-distance dependence , And can support parallelization , But is it really profitable without harm ？

It's not , Let's look back ,Self Attention Of Q、K、V The three matrices are composed of the same input $$X_1=(x_1,x_2,\cdots,x_n)$$ From linear transformation , In other words, for such a disordered sequence $$X_2=(x_2,x_1,\cdots,x_n)$$ for , because Attention The values are eventually weighted and summed , In other words, the final calculation of the two Attention The value is the same , Then it shows that Attention Lost $$X_1$$ Sequence order information of .

As shown in the figure above , In order to solve Attention Missing sequence order information ,Transformer The author of proposed Position Embedding, That is, for input $$X$$ Conduct Attention Before calculation , stay $$X$$ Add position information to the word vector of , in other words $$X$$ The word vector of is $$X_{final\_embedding} = Embedding + Positional\, Embedding$$

But how to get $$X$$ What about the position vector of ？

The position coding formula is shown in the figure below ：

among pos Indicate location 、i Represent dimension 、$$d_{model}$$ Represents the vector dimension of the position vector 、$$2i、2i+1$$ It represents odd and even numbers （ Parity dimension ）, The figure above shows the use of even numbers $$\sin$$ function , Use... For odd positions $$\cos$$ function .

With the location code , Let's look at how location coding embeds word coding （ among 512 Represents the encoding dimension ）, By superimposing the word vector and position vector of the word , This method is called location embedding , As shown in the figure below ：

Position Embedding It's an absolute location information in itself , But in the language model , Relative position is also important . So why is the location embedding mechanism useful ？

Let's not care about trigonometric formulas , You can see the formula in the figure below （3） The first line in , We make the following explanation , about “ I love apples ” This sentence , Yes 5 Word , Assume that the serial numbers are 1、2、3、4、5.

hypothesis $$pos=1= I 、k=2= Love 、pos+k=3= eat$$, in other words $$pos+k=3$$ One dimension of the position vector of the position can pass through $$pos=1$$ A linear representation of a one-dimensional linear combination of position vectors , From this linear representation, we can get “ eat ” The position coding information of contains relative to the first two words “ I ” Location coding information for .

All in all , The position information of a word is a linear combination of the position information of other words , This linear combination means that the position vector contains relative position information .

Ten 、Transformer

## 10.1 Transformer Structure

everything , Only east wind , Let's talk about one of our key points ,Transformer, You can remember this sentence first ：Transformer Simply put, it's actually self-attention Superposition of models , First let's take a look Transformer Overall framework .

Transformer The overall framework of is shown in the figure below ：

At first glance, the overall framework shown in the figure above is very complex , because Transformer It was originally used as a translation model , So let's take translation as an example , Simplify the overall framework above ：

As can be seen from the above figure Transformer Equivalent to a black box , Left input “Je suis etudiant”, You will get a translation result on the right “I am a student”.

Let's go into detail ,Transformer Also a Seq2Seq Model （Encoder-Decoder The model of the framework ）, One on the left Encoders Read in the input , One on the right Decoders Get the output , As shown below ：

ad locum , Let's intersperse the description Encoder-Decoder How does the framework model translate text ：

1. The sequence of $$(x_1,x_2,\cdots,x_n)$$ As Encoders The input of , Get the output sequence $$(z_1,z_2,\cdots,z_n)$$
2. hold Encoders The output sequence of $$(z_1,z_2,\cdots,z_n)$$ As Decoders The input of , Generate an output sequence $$(y_1,y_2,\cdots,y_m)$$. notes ：Decoders Output a result every time

See the above at first sight Encodes-Decoders Frame diagram , The resulting problem is Transformer in On the left Encoders How is the output of and the right Decoders The combination of . because decoders There is N Layer of , Draw another picture and see it intuitively ：

in other words ,Encoders Output , Will be with each floor Decoder Combine .

Now let's take one of the layers and show it in detail ：

Through the above analysis , Found that we want to know more about Transformer, Just know Transformer Medium Encoder and Decoder Unit is enough , Next, we will elaborate on these two units .

## 10.2 Encoder

With so much knowledge above , We know Eecoders yes N=6 layer , From the above figure, we can see each floor Encoder It includes two sub-layers：

• first sub-layer yes multi-head self-attention, Used to calculate the input self-attention;
• the second sub-layer It is a simple feedforward neural network layer Feed Forward;

Be careful ： At every sub-layer We all simulated the residual network （ The following data flow diagram will detail ）, Every sub-layer All the outputs are $$LayerNorm(x+Sub\_layer(x))$$, among $$sub\_layer$$ Represents the output of the previous layer of the layer

Now we give Encoder Data flow diagram of , Step by step to analyze

1. Dark green $$x_1$$ Express Embedding Layer output , Plus represents Positional Embedding After the vector of , Get the last input Encoder The eigenvectors in , That is, the light green vector $$x_1$$;
2. Light green vector $$x_1$$ Means the word “Thinking” Eigenvector of , among $$x_1$$ after Self-Attention layer , Become a light pink vector $$z_1$$;
3. $$x_1$$ As a direct connected vector of residual structure , Direct sum $$z_1$$ Add up , after Layer Norm operation , Get the pink vector $$z_1$$;
1. The role of residual structure ： Avoid the disappearance of gradients
2. Layer Norm The role of ： In order to ensure the stability of data feature distribution , And it can accelerate the convergence of the model
4. $$z_1$$ Through feedforward neural network （Feed Forward） layer , After the residual structure is added to itself , After that LN layer , We get an output vector $$r_1$$;
1. The feedforward neural network consists of two linear transformations and one ReLU Activation function ：$$FFN(x) = max(0,xW_1+b_1)W_2+b2$$
5. because Transformer Of Encoders have 6 individual Encoder,$$r_1$$ It will also be the next layer Encoder The input of , Instead of $$x_1$$ Role , So circular , Up to the last layer Encoder.

It should be noted that , Aforementioned $$x、z、r$$ All have the same dimension , In the paper is 512 dimension .

## 10.3 Decoder

Decoders It's also N=6 layer , From the above figure, we can see each floor Decoder Include 3 individual sub-layers：

• first sub-layer yes Masked multi-head self-attention, It is also the input of calculation self-attention;
• ad locum , Let's not explain why we do Masked, There is “Transformer Dynamic process presentation ” This section will explain
• the second sub-layer yes Encoder-Decoder Attention Calculation , Yes Encoder Input and Decoder Of Masked multi-head self-attention The output of attention Calculation ;
• ad locum , Also does not explain why to Encoder and Decoder Do it together with the output of attention Calculation , There is “Transformer Dynamic process presentation ” This section will explain
• Third sub-layer Feedforward neural network layer , And Encoder identical .

## 10.4 Transformer Output results

above , That's all Transformer Coding and decoding modules , So let's go back to the original problem , take “ machine learning ” Translate into “machine learing”, The output of the decoder is a floating-point vector , How to translate it into “machine learing” These two words ？ Let's see Encoders and Decoders Interactive process to find the answer ：

As can be seen from the above figure ,Transformer The final work is to let the output of the decoder pass through the linear layer Linear And then it's followed by a softmax

• The linear layer is a simple fully connected neural network , It takes the vector generated by the decoder A Projected onto a higher dimensional vector B On , Suppose the vocabulary of our model is 10000 Word , Then vector B There is 10000 Dimensions , Each dimension corresponds to the score of a unique word .
• After that softmax The layer converts these scores into probabilities . Select the dimension with the highest probability , And correspondingly generate the word associated with it as the output of this time step, which is the final output ！

Suppose the vocabulary dimension is 6, Then the process of outputting maximum probability words is as follows ：

11、 ... and 、Transformer Dynamic process presentation

First, let's take a look at Transformer When translating , How to generate translation results ：

Keep going ：

Suppose the above figure is a stage of the training model , Let's combine Transformer The complete framework describes this dynamic flow chart ：

1. Input “je suis etudiant” To Encoders, And then get a $$K_e$$、$$V_e$$ matrix ;
2. Input “I am a student” To Decoders , First, through Masked Multi-head Attention Layers get “I am a student” Of attention value $$Q_d$$, And then use attention value $$Q_d$$ and Encoders Output $$K_e$$、$$V_e$$ Matrix process attention Calculation , Get the first 1 Outputs “I”;
3. Input “I am a student” To Decoders , First, through Masked Multi-head Attention Layers get “I am a student” Of attention value $$Q_d$$, And then use attention value $$Q_d$$ and Encoders Output $$K_e$$、$$V_e$$ Matrix process attention Calculation , Get the first 2 Outputs “am”;
4. ……

Now let's explain the two problems we left behind .

## 11.1 Why? Decoder Need to do Mask

• Training phase ： We know “je suis etudiant” The translation result is “I am a student”, We put “I am a student” Of Embedding Input to Decoders Inside , Translate the first word “I” when

• If the “I am a student” attention Calculation does not do mask,“am,a,student” Yes “I” Your translation will make a certain contribution
• If the “I am a student” attention Calculate to do mask,“am,a,student” Yes “I” Your translation will not contribute
• Testing phase ： We don't know “ I love China ” The translation result is “I love China”, We can only initialize one randomly Embedding Input to Decoders Inside , Translate the first word “I” when ：

• Whether you do it or not mask,“love,China” Yes “I” No translation will contribute
• But I translated the first word “I” after , Randomly initialized Embedding With “I” Of Embedding, That is to say, in translating the second word “love” When ,“I” Of Embedding Will make a certain contribution , however “China” Yes “love” My translation has made no contribution , Then the translation goes on , The translated results will contribute to the next word to be translated , This is the same as doing mask A match was made in the training stage of

So in summary ：Decoder do Mask, In order to make the behavior consistent between the training phase and the testing phase , There will be no gaps , Avoid overfitting

## 11.2 Why? Encoder To give Decoders Yes. K、V matrix

We are explaining Attention The mechanism mentioned ,Query The purpose of is to use it to find important information from a pile of information .

Now? Encoder Provides $$K_e、V_e$$ matrix ,Decoder Provides $$Q_d$$ matrix , adopt “ I love China ” Translated into “I love China” This sentence explains in detail .

When we translate “I” When , because Decoder Provides $$Q_d$$ matrix , Through and with $$K_e、V_e$$ Matrix calculation , It can be “ I love China ” Find the right... In these four words “I” What are the most useful words for translation , And on this basis, translate “I” The word , This well reflects the purpose of the attention mechanism , Focus on information that is more important to you .

• In fact, the above is Attention Inside soft attention Mechanism , Solved the once Encoder-Decoder A problem of framework , Don't do more narration here , If you are interested, you can refer to some materials on the Internet .
• In the early Encoder-Decoder In the framework Encoder adopt LSTM Extract the source sentence （Source） “ I love China ” Characteristic information of C, then Decoder When doing translation , Target sentence （Target）“I love China” The translation of any word in comes from the same feature information C, This approach is extremely unreasonable , For example, translation “I” We should focus on “ I ”, translate “China” We should focus on “ China ”, This early practice did not reflect , However Transformer But through Attention Our approach solves this problem .
Twelve 、GPT Model

## 12.1 GPT Pre training of the model

In the interpretation of the ELMo When , We said to the ELMo This kind of pre training method is called “Feature-based Pre-Training”. And if you put ELMo This pre training method is compared with the pre training method in the image field , I found that the two models look very different .

Except for ELMo For the representative of this feature fusion based pre training method ,NLP There is also a typical way of doing it , This approach is consistent with the image field , This method is generally called “ be based on Fine-tuning The pattern of ”, and GPT It is the first mock exam of this model , Let's take a look at GPT Network structure .

GPT yes “Generative Pre-Training” For short , By name, it means generative pre training .

GPT It also uses a two-stage process ：

1. First stage ： Use language model for pre training ;
2. Second stage ： adopt Fine-tuning The model solves the downstream task .

The picture above shows GPT The pre training process of , Actually sum ELMo It's similar , The main difference is two points ：

1. First , The feature extractor does not use RNN, But with Transformer, Its feature extraction ability is better than RNN, This choice is obviously wise ;
2. secondly ,
1. GPT Although the pre training is still based on the language model as the target task , But it's a one-way language model , So-called “ A one-way ” Means ： The task goal of language model training is based on $$w_i$$ The context of the word to correctly predict the word $$w_i$$ , $$w_i$$ The previous word sequence Context-before Called the above , The following word sequence Context-after Called the following .
2. ELMo When doing language model pre training , Predict words $$w_i$$ Both the above and the following are used , and GPT Only use Context-before Let's make a prediction above this word , And put aside the following .
3. GPT This is not a good choice now , The reason is simple , It doesn't incorporate the following words , This limits its effect in more application scenarios , Such as reading comprehension , When doing tasks, it is allowed to see the decision-making above and below at the same time . If you don't embed the following words into Word Embedding in , It's a loss , Lost a lot of information in vain .

## 12.2 GPT Model Fine-tuning

It says GPT How to do the first stage of pre training , Then suppose the network model is pre trained , How to use downstream tasks later ？ It has its own personality , and ELMO In a very different way .

The picture above shows GPT How to use... In the second stage ：

1. First , For different downstream tasks , You could have designed your own network structure at will , Not now , You have to ask GPT The network structure is in line with , Transform the network structure of the task into GPT The same network structure .

2. then , When doing downstream tasks , Initialize with the parameters pre trained in the first step GPT Network structure , In this way, the linguistic knowledge learned through pre training will be introduced into the task at hand .

3. Again , You can train the network with the task at hand , On the network parameters Fine-tuning, Make this network more suitable for solving the problem at hand . this is it .

Does this remind you of the process of pre training in the image field mentioned at the beginning , Yes , This is as like as two peas in the pre training mode .

about NLP Different tasks of various patterns , How can we get close to GPT What about your network structure ？ because GPT Transformation process and of downstream tasks BERT The transformation of downstream tasks is very similar , And our main purpose is to explain BERT, So this problem will be in BERT There's an answer .

13、 ... and 、BERT Model

## 13.1 BERT： Recognized milestones

BERT The model can be regarded as a recognized milestone model , But its biggest advantage is not innovation , But the aggregator , And the integrator has made various breakthroughs , Let's take a look at BERT How did the master .

• BERT The meaning of is ： A depth model trained from a large number of unlabeled data sets , It can significantly improve the accuracy of various natural language processing tasks .
• In recent years, excellent pre training language models have been integrated ： Refer to the ELMO The bidirectional coding idea of the model 、 Learn from it GPT use Transformer As a feature extractor 、 Adopted word2vec What is used CBOW Method
• BERT and GPT The difference between ：
• GPT：GPT Use Transformer Decoder As a feature extractor 、 Good text generation ability , However, the semantics of the current word can only be determined by its antecedent , And lack of semantic understanding
• BERT： Used Transformer Encoder As a feature extractor , The matching mask training method is used . Although using two-way coding makes BERT No longer have text generation capability , however BERT The ability of semantic information extraction is stronger
• Difference between unidirectional coding and bidirectional coding , Take this sentence as an example “ It's very... Today {}, We have to cancel outdoor sports ”, From the perspective of unidirectional coding and bidirectional coding {} What words should be filled in ：
• Unidirectional coding ： Unidirectional coding only considers “ It's very... Today ”, With human experience , Probably from “ good ”、“ Pretty good ”、“ Bad ”、“ too bad ” Choose... From these words , These words can be divided into two distinct categories
• Bidirectional coding ： Bidirectional coding takes into account the information of context , That is, in addition to considering “ It's very... Today ” These five words , And think about “ We have to cancel outdoor sports ” To help the model judge , Then the high probability will start from “ Bad ”、“ too bad ” Choose... From this category

## 13.2 BERT Structure ： Strong feature extraction capability

• As shown in the figure below , Let's see ELMo、GPT and BERT Differences among the three

• ELMo Use both left to right encoding and right to left encoding LSTM The Internet , Respectively by $$P(w_i|w_1,\cdots,w_{i-1})$$ and $$P(w_i|w_{i+1},\cdots,w_n)$$ Train independently for the objective function , The trained feature vectors are spliced to realize bidirectional coding , In essence, it is one-way coding , It's just a bidirectional coding composed of one-way coding in two directions .
• GPT Use Transformer Decoder As Transformer Block, With $$P(w_i|w_1,\cdots,w_{i-1})$$ Training for the objective function , use Transformer Block replace LSTM As a feature extractor , Realize one-way coding , Is a standard pre training language model , That is to use Fine-Tuning The model solves the downstream task .
• BERT It is also a standard pre training language model , It uses $$P(w_i|w_1,\cdots,w_{i-1},w_{i+1},\cdots,w_n)$$ Training for the objective function ,BERT The encoder used belongs to bidirectional encoder .
• BERT and ELMo The difference is the use of Transformer Block As a feature extractor , It strengthens the ability of semantic feature extraction ;
• BERT and GPT The difference is the use of Transformer Encoder As Transformer Block, And will GPT The one-way coding is changed to two-way coding , in other words BERT Abandoned the ability of text generation , In exchange for a stronger semantic understanding .

BERT The structure of the model is shown in the figure below ：

You can see from the picture above ,BERT The model structure is actually Transformer Encoder Stacking of modules . In the selection of model parameters , This paper presents two sets of models with inconsistent sizes .

$$BERT_{BASE}$$ ：L = 12,H = 768,A = 12, The total parameter is 1.1 Billion

$$BERT_{LARGE}$$：L = 24,H = 1024,A = 16, The total parameter is 3.4 Billion

among L representative Transformer Block The number of layers ;H Represents the dimension of the eigenvector （ Default here Feed Forward The dimension of the middle hidden layer in the layer is 4H）;A Express Self-Attention Number of heads , These three parameters can be basically defined BERT The magnitude of .

BERT Calculation formula of parameter magnitude ：

\begin{align*} & Word vector parameters + 12 * （Multi-Heads Parameters + Full connection layer parameters + layernorm Parameters ）\\ & = （30522+512 + 2）* 768 + 768 * 2 \\ & + 12 * （768 * 768 / 12 * 3 * 12 + 768 * 768 + 768 * 3072 * 2 + 768 * 2 * 2） \\ & = 108808704.0 \\ & \approx 110M \end{align*}

The training process also takes a lot of computing resources and time , In short, it means to worship , Even if ordinary people have idea You can only kneel without calculating power .

## 13.3 BERT Unsupervised training

and GPT equally ,BERT The two-stage training method is also used ：

1. The first stage ： Use easily accessible large-scale unlabeled leftovers , To train the basic language model ;
2. The second stage ： Fine tune the training according to a small amount of labeled training data of the specified task .

differ GPT And other standard language models $$P(w_i|w_1,\cdots,w_{i-1})$$ Training for the objective function , Can see the global information BERT Use $$P(w_i|w_1,\cdots,w_{i-1},w_{i+1},\cdots,w_n)$$ Training for the objective function .

also BERT Use the language mask model （MLM） Methods to train the semantic understanding ability of words ; Use the following sentence to predict （NSP） Methods to train the ability of understanding between sentences , So as to better support downstream tasks .

## 13.4 BERT Language mask model （MLM）

BERT The author thinks that , The bidirectional encoder is spliced by the unidirectional encoder with left to right coding and right to left coding , In performance 、 Parameter scale and efficiency , It is not as powerful as using the depth bidirectional encoder directly , That's why BERT Use Transformer Encoder As a feature extractor , Instead of using the two methods of left to right coding and right to left coding Transformer Decoder As a feature extractor .

Because the training mode of standard language model cannot be used ,BERT Learn from cloze tasks and CBOW Thought , Use the language mask model （MLM ） Methods training model .

MLM The method is to randomly remove parts of the sentence token（ word ）, Then the model is used to predict the removed token What is it? . This is actually not the traditional neural network language model ( Similar to generating models ) 了 , But simply as a classification problem , According to the of this moment hidden state To predict this moment token What should be , Instead of predicting the probability distribution of words at the next moment .

Randomly removed token It's called a mask word , In training , The mask word will be in 15% The probability of is replaced by [MASK], That is to say, random mask In the corpus 15% Of token, This operation is called mask operation . Be careful ： stay CBOW In the model , Every word is predicted .

But this design MLM The training method will introduce disadvantages ： In the model tuning training stage or model reasoning （ test ） Stage , There will be no... In the text entered [MASK], This leads to performance loss caused by the deviation of training and prediction data .

Considering the above disadvantages ,BERT Not always [MASK] Replace the mask word , Instead, choose the replacement words according to a certain proportion . In the choice 15% After the words of are used as mask words, these mask words have three types of replacement options ：

• 80% In the training sample ： Use the selected word [MASK] Instead of , for example ：
“ The earth is [MASK] One of the eight planets ”

1.

• 10% In the training sample ： The selected word does not change , This method is to alleviate the performance loss caused by the deviation between training text and prediction text , for example ：
“ The earth is one of the eight planets in the solar system ”

1.

• 10% In the training sample ： Replace the selected word with any word , This is to make BERT Learn to correct errors automatically according to context information , for example ：
“ The earth is one of the eight planets of apple ”

1.


The author mentioned in his paper that the benefits of doing so are , The encoder doesn't know which words need to be predicted , Which words are wrong , So forced to learn every one token The representation vector of , In addition, the author also shows that the training of bidirectional encoder is slower than that of single encoder , Leading to BERT The training efficiency is much lower , But experiments also show that MLM Training methods can make BERT Gain semantic understanding beyond all pre trained language models in the same period , It's worth sacrificing training efficiency .

## 13.5 BERT The following sentence predicts （NSP）

In many downstream tasks of natural language processing , Such as Q & A and natural language inference , Logical reasoning is based on two sentences , The language model does not have the ability to directly capture the semantic links between sentences , Or it can be said that the training of word prediction granularity can not reach the level of sentence relationship , In order to learn to capture the semantic connection between sentences ,BERT The following sentence is used to predict （NSP ） As part of unsupervised pre training .

NSP The specific approach is ,BERT The input statement will consist of two sentences , among ,50% The probability takes two consecutive sentences with semantic coherence as the training text （ Continuous sentence pairs are generally selected from text level corpus , So as to ensure that the semantics of the previous and subsequent statements are strongly related ）, in addition 50% The probability will completely randomly select two sentences as the training text .

Continuous sentence right ：[CLS] The weather is terrible today [SEP] The afternoon physical education class was cancelled [SEP]

Random sentence pairs ：[CLS] The weather is terrible today [SEP] The fish is almost burnt [SEP]

among [SEP] The label represents the separator . [CLS] Indicates that the label is used for category prediction , The result is 1, Indicates that the input is a continuous sentence pair ; The result is 0, Indicates that the input is a random sentence pair .

Through training [CLS] Encoded output label ,BERT You can learn to capture the text semantics of two input sentence pairs , In the prediction task of continuous sentence pairs ,BERT The correct rate can reach 97%-98%.

## 13.6 BERT The input of represents

BERT In the pre training stage, the two training methods described above are used , In real training, the two methods are usually used together .

because BERT adopt Transformer Stacked models , therefore BERT You need two sets of inputs Embedding operation ：

1. One set is One-hot Thesaurus mapping coding （ Corresponding to Token Embeddings）;
2. The other is location coding （ Corresponding to Position Embeddings）, differ Transformer The position coding of is expressed by trigonometric function ,BERT The position coding will be obtained in the pre training process （ Training ideas are similar to Word Embedding Of Q matrix ）
3. Because in MLM During the training , There are single sentence input and double sentence input , therefore BERT You also need a set of segmentation codes that distinguish input statements （ Corresponding to Segment Embeddings）,BERT The segmentation coding will also be trained in the pre training process

For split coding ,Segment Embeddings There are only two vector representations of layers . The previous vector is to put 0 Assign to each... In the first sentence token, The latter vector is to put 1 Assign to each... In the second sentence token ; If the input is only one sentence , So it's segment embedding It's all 0, Let's give a simple example to describe ：

[CLS]I like dogs[SEP]I like cats[SEP] Corresponding code 0 0 0 0 0 1 1 1 1

[SEP]I Iike dogs and cats[SEP] Corresponding code 0 0 0 0 0 0 0

BERT The input and output forms of downstream tasks are processed according to natural language , The tasks supported by fine tuning training are divided into four categories , They are sentence pairs 、 Single sentence classification 、 Text Q & A and single sentence tagging , Next, we will briefly introduce BERT How to adapt to the requirements of these four types of tasks through fine-tuning training .

## 14.1 Sentence pair classification

Given two sentences , Judge their relationship , It is called sentence pair classification , For example, judge whether sentence pairs are similar 、 Judge whether the latter is the answer to the former .

For sentence pair classification task ,BERT Used during pre training NSP The training method obtains the ability to directly capture the semantic relationship between sentence pairs .

As shown in the figure below , Sentence to sentence [SEP] The separator is spliced into a text sequence , Add a tag at the beginning of the sentence [CLS], Take the output value corresponding to the sentence beginning label as the classification label , Calculate the cross entropy between the predicted classification label and the real classification label , Take it as the optimization goal , Fine tuning training on task data .

For two category tasks ,BERT There is no need to make any changes to the structure of input data and output data , Use directly with NSP Just use the same input and output structure as the training method .

For multi category tasks , Need to label at the beginning of the sentence [CLS] The output eigenvector is followed by a fully connected layer and Softmax layer , Ensure that the output dimension is consistent with the number of categories , Finally through arg max operation （ The index sequence number corresponding to the maximum value ） Get the corresponding category results .

The following is an example of a sentence bisection similarity task ：

Mission ： Judge the sentence “ I like you very much ” And sentences “ I like you very much ” Whether it is similar to

Enter overwrite ：“[CLS] I like you very much [SEP] I like you very much ”

take “[CLS]” The label corresponds to the output ：[0.02, 0.98]

adopt arg max The similar category obtained by operation is 1（ Category index from 0 Start ）, That is, the two sentences are similar

## 14.2 Single sentence classification

Give a sentence , Judge the category of the sentence , They are collectively referred to as single sentence classification , For example, judge the emotion category 、 Judge whether it is a semantically coherent sentence .

For the single sentence two classification task , There is no need to BERT Make any changes to the structure of input data and output data .

As shown in the figure below , Single sentence classification add a label at the beginning of the sentence [CLS], Take the output value corresponding to the sentence beginning label as the classification label , Calculate the cross entropy between the predicted classification label and the real classification label , Take it as the optimization goal , Fine tuning training on task data .

Again , For multi category tasks , Need to label at the beginning of the sentence [CLS] The output eigenvector is followed by a fully connected layer and Softmax layer , Ensure that the output dimension is consistent with the number of categories , Finally through argmax The operation gets the corresponding category result .

An example of semantic coherence judgment task is given below ：

Mission ： Judge the sentence “ Haida stars eat rice and tea ” Whether it is a sentence

Enter overwrite ：“[CLS] Haida stars eat rice and tea ”

take “[CLS]” The label corresponds to the output ：[0.99, 0.01]

adopt arg max The similar category obtained by operation is 0, That is, this sentence is not a semantically coherent sentence

## 14.3 Text Q & A

Give a question and a sentence with an answer , Find out where the answer is in the back , It's called text Q & A , For example, given a question （ The sentence A）, In a given paragraph （ The sentence B） Mark the actual position and ending position of the answer in .

Text Q & A is quite different from other tasks mentioned earlier , Whether in terms of optimization objectives , Or in the form of input data and output data , All need to do some special treatment .

To mark the start and end of the answer ,BERT Introduce two auxiliary vectors s（start, Determine the starting position of the answer ） and e（end, Determine where the answer ends ）.

As shown in the figure below ,BERT Judge the sentence B The answer position in is , Sentence B The final eigenvector obtained at each time in $$T_i'$$ Through the full connectivity layer （ The abstract semantic features of words are transformed into task oriented features by using the full connection layer ） after , And vector respectively s and e Find inner product , Separate all inner products softmax operation , You can get the word Tok m（$$m\in [1,M]$$） As an answer, the probability of location and termination location . Last , Go to the segment with the greatest probability as the final answer .

The fine tuning training of text answer task uses two techniques ：

1. Connect... With the full connection layer BERT The extracted deep feature vector is transformed into a feature vector for judging the position of the answer
2. Introduce auxiliary vector s and e As the answer, the reference vector of the actual position and the end position , Clarify the direction and measurement method of optimization objectives

An example of a text Q & a task is given below ：

Mission ： Given a question “ What's the maximum temperature today ”, In the text “ The weather forecast shows the highest temperature today 37 Centigrade ” Mark the starting and ending positions of the answers in the

Enter overwrite ：“[CLS] What's the maximum temperature today [SEP] The weather forecast shows the highest temperature today 37 Centigrade ”

BERT Softmax result ：

Text The weather prediction Show today Maximum temperature 37 Centigrade
Starting position probability 0.01 0.01 0.01 0.04 0.10 0.80 0.03
Probability of termination position 0.01 0.01 0.01 0.03 0.04 0.10 0.80

Yes Softmax The result is arg max, The starting position of the answer is 6, The termination position is 7, The answer is “37 Centigrade ”

## 14.4 Single sentence mark

Give a sentence , Label each time , It's called single sentence tagging . For example, given a sentence , Mark the person's name in the sentence 、 Place names and institutional names .

Single sentence tagging tasks and BERT Pre training tasks are quite different , But it is similar to the text Q & a task .

As shown in the figure below , In the single sentence annotation task , You need to add a full join layer after the final semantic feature vector of each word , The semantic features are transformed into the features required by the sequence annotation task , The single sentence tagging task requires tagging each word , Therefore, there is no need to introduce auxiliary vectors , Directly analyze the results after passing through the full connection layer Softmax operation , The probability distribution of various labels can be obtained .

because BERT Word segmentation is required for the input text , Independent words will be divided into several sub words , therefore BERT The predicted result will be 5 class （ Subdivided into 13 Subclass ）：

• O（ Names of non personal names and names of organizations ,O Express Other）
• B-PER/LOC/ORG（ The person's name / Place names / The initial word of the organization name ,B Express Begin）
• I-PER/LOC/ORG（ The person's name / Place names / The middle word of the organization name ,I Express Intermediate）
• E-PER/LOC/ORG（ The person's name / Place names / Organization name termination word ,E Express End）
• S-PER/LOC/ORG（ The person's name / Place names / The organization name is a separate word ,S Express Single）

take 5 The initials of major categories are combined , Available IOBES, This is the most commonly used annotation method for sequence annotation .

Named entity recognition is given below （NER） Examples of tasks ：

Mission ： A given sentence “ Einstein gave a speech in Berlin ”, according to IOBES mark NER result

Enter overwrite ：“[CLS] Love because Stan stay Berlin publish speech ”

BERT Softmax result ：

BOBES Love because Stan stay Berlin publish speech
O 0.01 0.01 0.01 0.90 0.01 0.90 0.90
B-PER 0.90 0.01 0.01 0.01 0.01 0.01 0.01
I-PER 0.01 0.90 0.01 0.01 0.01 0.01 0.01
E-PER 0.01 0.01 0.90 0.01 0.01 0.01 0.01
S-LOC 0.01 0.01 0.01 0.01 0.01 0.01 0.01

Yes Softmax The result is arg max, Get the final NER The result is ：“ Einstein ” It's a name ;“ Berlin ” Place names

## 14.5 BERT Effect display

in any case , As can be seen from the above explanation ,NLP All four types of tasks can be easily transformed into Bert An acceptable way , In short, different types of tasks require different modifications to the model , But the modifications are very simple , Add at most one layer of neural network . This is actually Bert The great advantage of , This means that it can do almost anything NLP Downstream tasks of , Universal , This is very strong .

But after talking so much , How about a new model , Effect is king . that Bert Use this two-stage approach to solve various problems NLP How effective the task is ？

stay 11 There are various types of NLP To achieve the best results so far , The performance of some tasks has been greatly improved .

15、 ... and 、 Pre training language model summary

Here we can sort out the evolution relationship between several models .

As can be seen from the above figure ,Bert Actually sum ELMO And GPT There are countless relationships , For example, if we take GPT The pre training phase is replaced by a two-way language model , So we got it Bert; And if we put ELMO Replace the feature extractor with Transformer, Then we will also get Bert.

So you can see ：Bert The most important two points , One is that the feature extractor uses Transformer; The second point is to use the two-way language model during pre training .

So here comes the new question ： about Transformer Come on , How can we do the bi-directional language model task on this structure ？ At first glance, it doesn't seem easy to do . I think so. , In fact, there is a very intuitive idea , What do I do ？ have a look ELMO Network structure diagram , Just put two LSTM Replace with two Transformer, One is responsible for , One is responsible for reverse feature extraction , In fact, it should be ok .

Of course, this is my own transformation ,Bert Didn't do it . that Bert How does it work ？ We didn't mention Word2Vec Do you ？ I certainly didn't mention it aimlessly , I mention it to lead to that CBOW Training methods , The so-called ambush pen when writing “ faint clue , It's a thousand miles ”, That's probably what it means ？

As mentioned earlier CBOW Method , Its core idea is ： When doing language model tasks , I cut out the words to predict , Then according to its above Context-Before And below Context-afte r To predict words .

Actually Bert How to do it? ？Bert That's what it does . From here, we can see the inheritance relationship between methods . Of course Bert The author didn't mention Word2Vec And CBOW Method , This is my judgment ,Bert The author said that he was inspired by the cloze task , It's possible , But I think if they didn't think of it CBOW It's probably impossible .

You can see it here , At the beginning of the article, I said Bert In fact, there is not much innovation in the model , More like a recent year NLP The integrator of important technologies , The reason is this , Of course, I'm not sure what you think , Do you agree with this view , And I don't care what you think . Actually Bert Its good effect and strong universality are the biggest highlights .

Last , Let me talk about my understanding of Bert Comments and opinions , I think Bert yes NLP Milestone work in , For the back NLP Our research and industrial applications will have a long-term impact , There is no doubt about that . But as we can see from the introduction above , From a model or method point of view ,Bert Learn from it ELMO,GPT And CBOW, Mainly put forward Masked Language model and Next Sentence Prediction, But here Next Sentence Prediction Basically not affecting the overall situation , and Masked LM Obviously borrowed from CBOW Thought . So Bert There's no big innovation in the model , More like in recent years NLP The aggregator of important progress , If you understand this point, I don't have much objection , If you have a big objection , I'm ready to wear this big hat for you .

If you sum up these developments, they are ： The first is a two-stage model , The first stage is two-way language model pre training , Here, pay attention to using two-way instead of one-way , The second phase uses specific tasks Fine-tuning Or feature integration ; Second, feature extraction should use Transformer As a feature extractor instead of RNN perhaps CNN; Third , The two-way language model can take CBOW The way to do （ Of course I think it's a matter of detail , It's not that critical , The first two factors are more critical ）.Bert The biggest highlight is good effect and strong universality , Almost all NLP Tasks can be applied to Bert This two-stage solution , And the effect should be significantly improved . Predictably enough , For some time in the future NLP Application field ,Transformer Will dominate , Moreover, this two-stage pre training method will also dominate various applications .

in addition , We should find out what the pre training process is essentially doing , In essence, pre training is to do language model tasks by designing a network structure , Then make use of a large number of even endless unmarked natural language texts , The pre training task extracts a large amount of linguistic knowledge and encodes it into the network structure , When the task at hand has limited data with annotation information , These a priori linguistic features will certainly be a great complement to the task at hand , Because when the data is limited , Many linguistic phenomena cannot be covered , The generalization ability is weak , Integrating linguistic knowledge as general as possible will naturally enhance the generalization ability of the model . How to introduce a priori linguistic knowledge has always been NLP Especially in the context of deep learning NLP One of the main goals of , But there has been no good solution , and ELMO/GPT/Bert This two-stage model seems to be a natural and concise way to solve this problem , This is also the main value of these methods .

For the present NLP Development direction , Personally, I think two things are very important ：

1. One is the need for stronger feature extractors , Now look at Transformer Will gradually take on the big responsibility , But it's definitely not strong enough , We need to develop stronger feature extractors ;
2. The second is how to gracefully introduce linguistic knowledge contained in a large number of unsupervised data , Note that I emphasize elegance here , Instead of introducing , A lot of previous work has tried to graft or introduce various linguistic knowledge , But many ways look toothache , That's what I said is not elegant .

At present, the two-stage method of pre training is still very effective , It's also very concise , Of course, there will be better models later .

Finished , This is the past and present life of the pre training language model .

Because I'm just getting started NLP Direction , Don't summarize yourself , The above summary comes from Zhihu article ：  from Word Embedding To Bert Model — The development history of pre training technology in natural language processing -  Zhang Junlin

sixteen 、 Reference material

I'm just a knowledge Porter , Readers who want to know more about each knowledge point can choose to refer to the following materials .

• Reference books ：

• 《 Pre training language model 》- Pang Hao 、 Liu Yifeng

• 《 be based on BERT Natural language processing practice of the model 》- Li Jinhong

• Reference paper ：

• Reference blog ：

https://chowdera.com/2021/08/20210808153459129X.html