当前位置：网站首页>Is the effect of Bert fine tuning poor? Why not try this new paradigm of large-scale pre training model
Is the effect of Bert fine tuning poor? Why not try this new paradigm of large-scale pre training model
2020-12-08 14:29:28 【NewBeeNLP】
BERT It's been two years since the model was released , But the rest of the heat has not abated . From a sensation that came out , Simple application / Fine tuning can be achieved in a certain field or task SOTA effect ; Now all kinds of 『 Be beaten by hanging 』,BERT The new favorites on the shoulders of giants can be roughly divided into the following categories ：
BERT & Beyond： Sort it out BERT family , Sorting , Coming soon ！
- Bigger： Bigger and more training data and parameters , The effect is better. , Violence Aesthetics . for example RoBERT,Turing NLG etc. ;
- Smaller： For faster speed , Less memory , The beauty of simplicity . for example TinyBERT,ALBERT etc. ;
- Smarter： Combine more and richer external knowledge , A great integrator . for example ERNIE,K-BERT etc. ;
- Tricker： Optimize model structure 、 Pretraining task , creations . for example XLNet,Performer etc.
- Cross Domain： More comprehensive solutions , Cross the border and go out of the circle . for example LXMERT,ViT etc.
- Analyzer： The underlying design of the model / principle / effect / Explainability and so on
Looking at most of the pre training models so far , In general, it is still used 『Pretraining + Finetuning』 Two stage model . The first stage , Through a lot of training data （ It's usually unsupervised samples ） And model parameters , Learn general knowledge from the text ; The second stage , For specific downstream areas and tasks , Using existing monitoring data , Fine tuning the general model of the previous step , Get the field / The model of task adaptation will achieve better results .
Better results ？ It doesn't have to be , In fact, in the process of applying the above two stages , There will be many problems . such as , Direct use of pre training BERT Too general , Unable to accurately learn the knowledge of a specific task , And fine tuning lacks enough supervised data . Now it's time to think about something else ：
- The first one is , There are monitoring data for fine tuning （ If you have money, please crowdsource , No money ......）
- The second kind , Put the field / Task related unsupervised data are added to the first phase of pre training , But it's a little contradictory to the original intention , It's a universal model to learn from ？ Or is it a model of learning field focus ？
- The third kind of , A little bit of tweaking the way the two-stage application works , It turns into three stages . The first phase is still large-scale pre training （pre-trianing）, To obtain a general model; The second stage becomes a specific field / Task pre training （post-training）, stay general model On the basis of , Use domain data to do the same training again ; The third stage , Using a small amount of supervised data Finetune.
In terms of the above three options , The third is undoubtedly the most reasonable . There are also a lot of such studies , Let's look at two papers ~
Do not stop pretraing
This article ACL2020 It's more like an experimental report , We explored the feasibility of the third option we mentioned above , The model to be pre trained 「 Retraining 」 Whether there is effect improvement in the target task area . Focus on two pieces of ：domain-adaptive pretraining and task-adaptive pretraining, The experiment is in four areas （ biological 、CS、 Journalism 、 Comment on ） It is found that all of the eight classification tasks can improve performance .
- The paper ：Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
- Address ：https://arxiv.org/abs/2004.10964
- Source code ：https://github.com/allenai/dont-stop-pretraining
Let's take a specific look at
Domain-Adaptive PreTraining（DAPT） On the basis of the general pre training model in the first stage , Using unlabeled text to continue training .
In order to characterize the similarity of a domain , For all areas of top 10K A high frequency word （ To stop using words ） The coincidence analysis is done , as follows . This can be expected to ROBERTA To adapt to the benefits of different fields , And ROBERTA The less similar the domain of ,DAPT The higher the potential of .
The experimental results are as follows , You can see , All areas of DAPT Improved on the original model , And it's very obvious ！
Task-Adaptive PreTraining（TAPT） On the basis of the general pre training model in the first stage , Use task related unlabeled text to continue training . Compared with DAPT,TAPT Is relatively small .
It can be seen from the above experiments that ,DAPT and TAPT Compared with the original RoBERTA There are promotions , And they work better together . This shows that the third scheme proposed at the beginning of the article is feasible , Continuous training of the pre training model for specific field corpus can bring obvious benefits .
Train No Evil
From Mr. Liu Zhiyuan of Tsinghua University EMNLP2020 Work , Also explore the present NLP The new paradigm of the pre training model .
- The paper ：Train No Evil: Selective Masking for Task-Guided Pre-Training
- Address ：https://arxiv.org/abs/2004.09733
- Source code ：https://github.com/thunlp/SelectiveMasking
The whole idea and "Do not stop pretraing" The paper is the same , The difference is , It's very slippery to use 「Selective Masking」 take DAPT and TAPT Together .
It's also three stages ：
- 「 General pre training （GenePT）」： The first stage of tradition , Large scale unsupervised generalized data + Random MASK Preliminary training , Data sets
It's about 「1000M」 word
- 「 Task oriented pre training （TaskPT）」： The goal of this phase is to add task and domain related information , Use mid-scale Domain related unsupervised data , Data sets
It contains about 「10M」 word
- 「 Fine tuning stage （Fine-tuneing）」：don't say so much, Data sets
It contains about 「10K」 word
Task oriented pre training （TaskPT)
In fact, what we need to see is the second stage above , How to add tasks and domain information . As mentioned earlier , Domain information can be introduced through data sets , What about mission related information ？
Actually , In text ,「 Different words contribute differently to different tasks 」, And these words are usually different . Take a chestnut ,『like』、『hate』 Words like this are key to the task of emotion classification , In relation extraction task , The predicate and the verb are relatively more important . therefore , At random in pre training MASK That step , We can choose mask These task related words （Selective Masking）, Make the model learn the task information .
So the next step is how to find these 「 Mission key words 」 了 . It might be a little winding here , Let's go through QA In the form of question and answer
1、 How to judge which words are more important ？
Suppose a sequence
Words make up ,
, The importance of a word is used
Express . Design an auxiliary sequence
, Throw in one word at a time in order , For example, at present The first
Step , In this case, the auxiliary sequence is
, So the current word
The score can be calculated by the following formula ,
It means task t The label of ,
Represents the confidence that the correct label can be obtained for a given original sequence .
Denotes a given inclusion
The confidence level of the correct label can be obtained from the inner part of the sequence , The bigger this one is, the more
The more important . therefore
The smaller it is
The more important , That is to add
The more similar the contribution to the original sequence of tasks .
2、 Degree of confidence
How to get ？
Use fine tuned on a specific task BERT Model output confidence
3、 How to label task keywords to unsupervised domain data ？
You can see , The above two steps to obtain the importance word need to mark the data , that
It's unsupervised , What shall I do? ？ At this time, you can use the above method in
One vote of data is marked on the dataset , Tag whether words are mission critical ; And then use this vote of labeled data to train a model
, Learn the key words of the task , Finally, with this model
stay Data sets
Superscript and MASK Mission key words . It's done like this
It can be used for 『 Task oriented pre training （TaskPT)』 了 .
Sem14-Rest and MR It means two tasks ,Yelp and Amazon Represents two domain datasets . You can see , In four groups of experiments , Additional pre training will improve the effect of the model , And in contrast to 『Random Mask』 And 『Selective Mask』, Find out 『Selective Mask』 better . For a more detailed explanation of the experiment , You can turn to the original paper to observe ~
Thank you for the League of heroes 「 Spicy chicken update 」, And fat cat owners 「 Don't step on the keyboard 」, Let me have time to organize my notes
This article is from WeChat official account. - NewBeeNLP（NewBeeNLP） , author ：kaiyuan
The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the email@example.com Delete .
Original publication time ： 2020-11-23
Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World