UIUC | course learning for language model
2021-08-08 15:44:42 【Author: Ke】
【 Paper title 】Curriculum learning for language modeling
【 The author team 】 Daniel Campos
【 Time of publication 】2021/08/04
【 machine structure 】UIUC
【 Thesis link 】https://arxiv.org/pdf/2108.02170v1.pdf
【 Code link 】https://github.com/spacemanidol/CurriculumLearningForLanguageModels
【 Recommended reasons 】 A negative case of course learning and pre training
image ELMo and BERT Such a language model serves as a language understanding component for various downstream tasks , Provides a powerful natural language representation . Course learning is a method of adopting a structured training system , It is used in computer vision and machine translation to improve model training speed and model performance . Although the language model has been proved to be of revolutionary significance to the field of natural language processing , But these models have proven to be expensive 、 Energy intensive and challenging training . In this work , We explored the impact of course learning on language model pre training , Courses using various linguistic motivations , And evaluated GLUE Benchmark migration performance . Although there are various training methods and experiments , We have found no convincing evidence that curriculum learning methods can improve language model training .
The figure above shows the course learning CBC（competence based curriculum ） Algorithm . corpus X It's a sample S Set , Each of these samples si Is a sequence of words , Sort by difficulty, using a heuristic method , Such as sentence length or the rarity of a single sentence , Assigned a difficulty . A model is assigned an initial capability λ0 And a capability increment λ increment, The ability score of a model represents the progress of the model in the training process . In each training step , The model is from below its current capability λt Sample from the data , Update its weight , And increase its ability λt.
This paper explores the of sample difficulty 8 A proxy indicator ： No course 、 Random 、 Sample length 、 Word sample probability 、 Large sample probability 、 Three word sample probability 、 Discourse diversity （POS） And sample dependency resolution complexity （DEP）.
The picture above shows the wikitext-2（ Small ） The result on . We find no strong evidence that the structure of the curriculum is important , Because it's not a course （λ0=1） Is better than others 4 Courses and baselines perform better . Perhaps the most surprising thing is , Although there is no formal structure in the training system , But in general glue Score to measure , Randomization performed better than baseline . When observing the variability of a single task , We found that only CoLA、STS-B and SST There is a wide range of variability in performance . We think this is because these missions are small , More challenging in language .
The picture above shows the wikitext-3（ Big ） The result on . We found out wikitext-2 The trend found in is not tenable , Because the highest performance is achieved by the baseline model . We also note that , The ranking of system performance does not hold on different data sets , And with the increase of pre training data set , The differences between models are also decreasing . Similar to smaller corpora , We found that ColA The highest sensitivity , And found SST and STS-B The variability of becomes softer .
- In our work , We have found no strong evidence that using curriculum learning can improve the pre training of language model . Our work is based on CBC The training mechanism can not learn the good representation of the training corpus , But their characterization can migrate well downstream NLP Tasks . We found that , When the pre training corpus is small ,CBC The method can be better than random sampling , But with the expansion of corpus , This advantage will disappear . Besides , We found no evidence that any type of heuristic difficulty is right CBC It is more appropriate for .
- Fourth in the world! Wang Sicong installed a server "readily". Netizen: trench is inhuman
- [Tencent classroom] creator zero foundation immortal practice is online!
- Follow Huawei and learn digital transformation (3): mode innovation
- Record an interface slow check and troubleshooting
- ss -h命令
- @Do you know all these operations of Autowired?
- 使用Yolo v5进行目标检测
Identify and stop the process that‘s listening on port 8080 or configure this application to listen
[PyTroch系列-11]：PyTorch基础 - 张量Tensor元素的排序
[PyTroch系列-12]：PyTorch基础 - 张量Tensor线性运算（点乘、叉乘）
【环境篇】第 3 节 • Navicat 环境安装
预训练语言模型的前世今生 - 从Word Embedding到BERT
- 华南理工 | 基于生成式的低比特无数据量化
- 一行代码快速实现今日头条 网易新闻焦点图自动循环轮播效果
- 用一张草图创建GAN模型，新手也能玩转，朱俊彦团队新研究入选ICCV 2021
- UIUC | 用于语言模型的课程学习
- SS - H command
- Target detection using Yolo V5
- Yazid's freshman ball (thread tree)
- When creator meets protobufjs
- Identify and stop the process that‘s listening on port 8080 or configure this application to listen
- Why recommend learning bytecode?
- SAP Commerce Cloud UI 的用户会话管理
- 以太坊 交易 data字段 内容是什么
- SAP CRM Fiori 应用 My Note 里创建 Note 失败的一个原因分析
- Uncover the secret! Millions of pixel color filling solutions. Blessed are those who want to develop picture book applications!
- [pytroch series - 11]: pytorch basis - ordering of tensor tensor elements
- [pytroch series - 12]: pytorch basis tensor tensor linear operation (point multiplication, cross multiplication)
- [environment] section 3 • Navicat environment installation
- The past and present life of pre training language model - from word embedding to Bert
- Make sense, as long as you are a tossing programmer, you really don't need to spend money on training to find a job after graduation!
- South China Technology | low bit no data quantization based on generative
- Wechat applet authorizes location and user information permissions (to prevent users from being unable to use location information after prohibition)
- One line of code can quickly realize the automatic circular rotation effect of today's headlines and Netease News focus map
- Causal emergence: mathematical theory reveals how the whole is greater than the sum of parts
- The troubles of AI scientists with an annual income of millions of dollars
- API "why is the Olympic Games marked by five color rings?" Data source interface
- Create a GaN model with a sketch, which can be played by novices. The new research of Zhu Junyan's team was selected into iccv 2021
- UIUC | course learning for language model
- I'm sure! You haven't used a code artifact yet. It only belongs to creator users!