当前位置:网站首页>Is the effect of Bert fine tuning poor? Why not try this new paradigm of large-scale pre training model

Is the effect of Bert fine tuning poor? Why not try this new paradigm of large-scale pre training model

2020-12-08 14:29:28 NewBeeNLP

BERT It's been two years since the model was released , But the rest of the heat has not abated . From a sensation that came out , Simple application / Fine tuning can be achieved in a certain field or task SOTA effect ; Now all kinds of 『 Be beaten by hanging 』,BERT The new favorites on the shoulders of giants can be roughly divided into the following categories :

BERT & Beyond: Sort it out BERT family , Sorting , Coming soon !

  • Bigger: Bigger and more training data and parameters , The effect is better. , Violence Aesthetics . for example RoBERT,Turing NLG etc. ;
  • Smaller: For faster speed , Less memory , The beauty of simplicity . for example TinyBERT,ALBERT etc. ;
  • Smarter: Combine more and richer external knowledge , A great integrator . for example ERNIE,K-BERT etc. ;
  • Tricker: Optimize model structure 、 Pretraining task , creations . for example XLNet,Performer etc.
  • Cross Domain: More comprehensive solutions , Cross the border and go out of the circle . for example LXMERT,ViT etc.
  • Analyzer: The underlying design of the model / principle / effect / Explainability and so on

Looking at most of the pre training models so far , In general, it is still used 『Pretraining + Finetuning』 Two stage model . The first stage , Through a lot of training data ( It's usually unsupervised samples ) And model parameters , Learn general knowledge from the text ; The second stage , For specific downstream areas and tasks , Using existing monitoring data , Fine tuning the general model of the previous step , Get the field / The model of task adaptation will achieve better results .

Better results ? It doesn't have to be , In fact, in the process of applying the above two stages , There will be many problems . such as , Direct use of pre training BERT Too general , Unable to accurately learn the knowledge of a specific task , And fine tuning lacks enough supervised data . Now it's time to think about something else :

  • The first one is , There are monitoring data for fine tuning ( If you have money, please crowdsource , No money ......)
  • The second kind , Put the field / Task related unsupervised data are added to the first phase of pre training , But it's a little contradictory to the original intention , It's a universal model to learn from ? Or is it a model of learning field focus ?
  • The third kind of , A little bit of tweaking the way the two-stage application works , It turns into three stages . The first phase is still large-scale pre training (pre-trianing), To obtain a general model; The second stage becomes a specific field / Task pre training (post-training), stay general model On the basis of , Use domain data to do the same training again ; The third stage , Using a small amount of supervised data Finetune.

In terms of the above three options , The third is undoubtedly the most reasonable . There are also a lot of such studies , Let's look at two papers ~

Do not stop pretraing

This article ACL2020 It's more like an experimental report , We explored the feasibility of the third option we mentioned above , The model to be pre trained 「 Retraining 」 Whether there is effect improvement in the target task area . Focus on two pieces of :domain-adaptive pretraining and task-adaptive pretraining, The experiment is in four areas ( biological 、CS、 Journalism 、 Comment on ) It is found that all of the eight classification tasks can improve performance .

  • The paper :Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
  • Address :https://arxiv.org/abs/2004.10964
  • Source code :https://github.com/allenai/dont-stop-pretraining

Let's take a specific look at

Domain-Adaptive PreTraining

Domain-Adaptive PreTraining(DAPT) On the basis of the general pre training model in the first stage , Using unlabeled text to continue training .

In order to characterize the similarity of a domain , For all areas of top 10K A high frequency word ( To stop using words ) The coincidence analysis is done , as follows . This can be expected to ROBERTA To adapt to the benefits of different fields , And ROBERTA The less similar the domain of ,DAPT The higher the potential of .

The experimental results are as follows , You can see , All areas of DAPT Improved on the original model , And it's very obvious !

Task-Adaptive PreTraining

Task-Adaptive PreTraining(TAPT) On the basis of the general pre training model in the first stage , Use task related unlabeled text to continue training . Compared with DAPT,TAPT Is relatively small .

It can be seen from the above experiments that ,DAPT and TAPT Compared with the original RoBERTA There are promotions , And they work better together . This shows that the third scheme proposed at the beginning of the article is feasible , Continuous training of the pre training model for specific field corpus can bring obvious benefits .

Train No Evil

From Mr. Liu Zhiyuan of Tsinghua University EMNLP2020 Work , Also explore the present NLP The new paradigm of the pre training model .

  • The paper :Train No Evil: Selective Masking for Task-Guided Pre-Training
  • Address :https://arxiv.org/abs/2004.09733
  • Source code :https://github.com/thunlp/SelectiveMasking

The whole idea and "Do not stop pretraing" The paper is the same , The difference is , It's very slippery to use 「Selective Masking」 take DAPT and TAPT Together .

It's also three stages :

  • 「 General pre training (GenePT)」: The first stage of tradition , Large scale unsupervised generalized data + Random MASK Preliminary training , Data sets
D_{General}

It's about 「1000M」 word

  • 「 Task oriented pre training (TaskPT)」: The goal of this phase is to add task and domain related information , Use mid-scale Domain related unsupervised data , Data sets
D_{Domain}

It contains about 「10M」 word

  • 「 Fine tuning stage (Fine-tuneing)」:don't say so much, Data sets
D_{Task}

It contains about 「10K」 word

Task oriented pre training (TaskPT)

In fact, what we need to see is the second stage above , How to add tasks and domain information . As mentioned earlier , Domain information can be introduced through data sets , What about mission related information ?

Actually , In text ,「 Different words contribute differently to different tasks 」, And these words are usually different . Take a chestnut ,『like』、『hate』 Words like this are key to the task of emotion classification , In relation extraction task , The predicate and the verb are relatively more important . therefore , At random in pre training MASK That step , We can choose mask These task related words (Selective Masking), Make the model learn the task information .

So the next step is how to find these 「 Mission key words 」 了 . It might be a little winding here , Let's go through QA In the form of question and answer

1、 How to judge which words are more important ?

Suppose a sequence

n

Words make up ,

s=(w_{1}, w_2, ...,w_n)

, The importance of a word is used

S(w_i)

Express . Design an auxiliary sequence

s^{'}

, Throw in one word at a time in order , For example, at present The first

(i)

Step , In this case, the auxiliary sequence is

s^{'}_{i-1} w_i

, So the current word

w_i

The score can be calculated by the following formula ,

\mathrm{S}\left(w_{i}\right)=P\left(y_{\mathrm{t}} \mid s\right)-P\left(y_{\mathrm{t}} \mid s_{i-1}^{\prime} w_{i}\right)

among ,

y_t

It means task t The label of ,

P\left(y_{\mathrm{t}} \mid s\right)

Represents the confidence that the correct label can be obtained for a given original sequence .

P\left(y_{\mathrm{t}} \mid s_{i-1}^{\prime} w_{i}\right)

Denotes a given inclusion

w_i

The confidence level of the correct label can be obtained from the inner part of the sequence , The bigger this one is, the more

w_i

The more important . therefore

S(w_i)

The smaller it is

w_i

The more important , That is to add

w_i

The more similar the contribution to the original sequence of tasks .

2、 Degree of confidence
P\left(y_{\mathrm{t}} \mid s\right)

How to get ?

Use fine tuned on a specific task BERT Model output confidence

3、 How to label task keywords to unsupervised domain data ?

You can see , The above two steps to obtain the importance word need to mark the data , that

D_{Domain}

It's unsupervised , What shall I do? ? At this time, you can use the above method in

D_{Task}

One vote of data is marked on the dataset , Tag whether words are mission critical ; And then use this vote of labeled data to train a model

M

, Learn the key words of the task , Finally, with this model

M

stay Data sets

D_{Domain}

Superscript and MASK Mission key words . It's done like this

D_{Domain}

It can be used for 『 Task oriented pre training (TaskPT)』 了 .

experiment

Sem14-Rest and MR It means two tasks ,Yelp and Amazon Represents two domain datasets . You can see , In four groups of experiments , Additional pre training will improve the effect of the model , And in contrast to 『Random Mask』 And 『Selective Mask』, Find out 『Selective Mask』 better . For a more detailed explanation of the experiment , You can turn to the original paper to observe ~

tail

Thank you for the League of heroes 「 Spicy chicken update 」, And fat cat owners 「 Don't step on the keyboard 」, Let me have time to organize my notes

This article is from WeChat official account. - NewBeeNLP(NewBeeNLP) , author :kaiyuan

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2020-11-23

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[NewBeeNLP]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012081428430889.html