当前位置:网站首页>The first open source Chinese Bert pre training model in the financial field

The first open source Chinese Bert pre training model in the financial field

2020-11-08 12:56:55 osc_j3111wl4

Produce  | AI Technology base

The first figure | CSDN Pay to download in the East IC

In order to promote the application and development of natural language processing technology in the field of financial technology , Entropy technology AI Lab Recently open source based on BERT Architecture of financial domain pre training language model FinBERT 1.0.

It is reported that , This is the first open source Chinese language trained on large-scale financial corpus in China BERT Pre training model . be relative to Google Native Chinese released BERT、 Harbin Institute of technology iFLYTEK laboratory open source BERT-wwm as well as RoBERTa-wwm-ext Wait for the model , Open source FinBERT 1.0 The pre training model has achieved significant performance improvement in a number of downstream tasks in the financial field , Without any additional adjustment ,F1-score Direct promotion, at least 2~5.7 percentage .

For the deep learning era of natural language processing technology , It is generally believed that there are two major milestones in the work . The first milestone is in 2013 The rise of , With Word2Vec For the representative of the word vector technology ; The second milestone is in 2018 In the past years BERT For the representative deep pre training language model (Pre-trained Language Models).

One side , With BERT Represent the depth of the pre training model in including text classification 、 Named entity recognition 、 Q & A and so on almost all sub areas have reached new state of the art; On the other hand , As a general pre training model ,BERT The emergence of the phenomenon also significantly reduced NLP Algorithm Engineer in the specific application of heavy work , From the old magic network to Fine tune BERT, You can quickly get a baseline model with excellent performance . therefore , The deep pre training model has become AI Basic skills necessary for a team .

however , Current open source deep pre training models for various Chinese fields , Most of them are application requirements oriented to general domain , There is no open source model in many vertical fields including finance . It is hoped that this time open source science and technology will pass , Push NLP The application and development of technology in the financial field , And better performance will be introduced when the right time is right FinBERT 2.0 & 3.0.

Project address :

https://github.com/valuesimplex/FinBERT

Models and pre training methods

2.1 Network structure

Entropy simplification FinBERT In the network structure, we adopt and Google The native of Publishing BERT The same architecture , Contains FinBERT-Base and FinBERT-Large Two versions , The former uses 12 layer Transformer structure , The latter uses 24 layer Transformer structure . Taking into account the convenience and universality of practical use , The model released this time is FinBERT-Base edition , In the latter part of this paper, we use FinBERT Generation refers to FinBERT-Base.

2.2 training corpus

FinBERT 1.0 The pre training corpus mainly includes three types of financial fields , They are as follows :

  • Financial news : Financial and financial news information collected from public channels in the last ten years , about 100 Ten thousand ;

  • Research Report / Announcement of listed company : Various research reports and company announcements collected from public channels , come from 500 Many domestic and foreign research institutions , involve 9000 Listed companies , contain 150 Many different types of research papers , Co ownership 200 Ten thousand ;

  • Financial encyclopedia entry : from Wiki Financial encyclopedia entries collected from other channels , about 100 Ten thousand .

For the above three types of corpus , Under the guidance of financial business experts , We screen important parts of various corpus 、 After preprocessing, the final corpus for model training is obtained , contain 30 Billion Tokens, That's more than native Chinese BERT The scale of training .

2.3 Pre training mode

FinBERT Pre training framework

As shown in the figure above ,FinBERT Two major types of pre training tasks were used , They are word level pre training and task level pre training . Details of the two types of pre training tasks are as follows :

(1) Word level training

Word level pre training first includes two sub tasks , Namely Finnacial Whole Word MASK(FWWM)、Next Sentence Prediction(NSP). meanwhile , In training , In order to save resources , We have adopted and Google Similar two-stage pre training , In the first stage, the maximum sentence length is 128, In the second stage, the maximum sentence length is 512. The specific forms of the two types of tasks are as follows :

  • Finnacial Whole Word MASK(FWWM)

Whole Word Masking (wwm), Generally translated into whole word Mask Or the whole word Mask, What's the matter Google stay 2019 year 5 An updated version of BERT in , Mainly changed the training sample generation strategy of the original pre training stage . Simply speaking , Based on WordPiece A complete word can be divided into several sub words , When generating training samples , These separated sub words will be randomly mask. In the whole word Mask in , If a complete part of a word WordPiece Zi Ci was Mask, Other parts of the same word will also be Mask, The whole word Mask.

In Google's native Chinese BERT in , Input is segmented by word granularity , The relationship between co-occurrence words or phrases in the domain is not considered , Thus, we can not learn the prior knowledge implied in the field , It reduces the learning effect of the model . We will use the whole word Mask The method is applied to the corpus pre training in financial field , That is to say, all the Chinese characters of the same word are carried out Mask. First of all, from the financial dictionary 、 In financial academic articles , Through automatic mining combined with manual verification , Build a dictionary in the field of Finance , There are about 10 Ten thousand words . Then extract the words or phrases in the pre corpus and financial dictionary to complete the whole word Mask Preliminary training , Thus, the model can learn prior knowledge in the field , Such as the concept of finance 、 The correlation between financial concepts, etc , So as to enhance the learning effect of the model .

  • Next Sentence Prediction(NSP)

To train a model to understand the relationship between sentences , Introduce the next prediction task . Please refer to BERT Original literature ,Google The results of the paper show that , This simple task is very useful for question answering and natural language reasoning , We also found in the pre training process to remove NSP After the task, the effect on the model is slightly reduced , So we keep NSP Pre training task of , The learning rate is used Google Officially recommended 2e-5,warmup-steps by 10000 steps.

(2) Task level pre training

In order to make the model better learn the financial domain knowledge of semantic level , Learn more comprehensively the distribution of the features of words and sentences in the financial field , We also introduce two types of supervised learning tasks , They are the classification of research and newspaper industry and the financial entity identification task of financial news , As follows :

  • Classification of research and newspaper industry

Comment on the company 、 Industry review class Research Report , Nature has a good industry attribute , Therefore, we use this kind of research papers to automatically generate a large number of corpus with industry labels . Based on this, a document level supervision task of industry classification is constructed , The corpus of each industry category is in 5k~20k Between , About 40 Ten thousand document level corpora .

  • Financial entity identification of financial news

Similar to the task of classification of research and reporting industry , We use the existing enterprise industrial and commercial information base and publicly available information on directors, supervisors and senior executives of listed companies , Based on financial and financial news, the task corpus of named entity recognition class is constructed , It contains 50 Ten thousand supervised corpora .

As a whole , To make FinBERT 1.0 The model can fully learn the semantic knowledge in the financial field , We are in the original BERT Based on the model pre training, the following improvements are made :

1、 Longer training time , The training process is more complete . In order to achieve better model learning effect , We extend the pre training time of the second stage of the model to the first stage tokens The total amount is consistent ;

2、 Integrating knowledge in the financial field . Introduce phrases and semantic level tasks , And extract proper nouns or phrases in the field , Use whole words Mask And two types of supervised tasks are pre trained ;

3、 In order to make full use of the pre training corpus , A similar Roberta Dynamic masking of models mask Mechanism , take dupe-factor Parameter set to 10.

2.4 Pre training accelerates

At present , For a complete set of software and hardware deep learning alchemy system , NVIDIA provides a wealth of technical support and framework optimization , One of the most important points is how to accelerate in training . stay FinBERT In training , We mainly used Tensorflow XLA and Automatic Mixed Precision These two types of techniques are pre trained to accelerate .

  • Tensorflow XLA Train and speed up

XLA It's called accelerated linear operation , If in Tensorflow It turns on XLA, Then the compiler will do something about Tensorflow The calculation diagram is optimized in the execution phase , By generating specific GPU Kernel sequence to save computing process for hardware resources consumption . generally speaking ,XLA Can provide 40% The acceleration of .

  • Automatic Mixed Precision

The training process of general deep learning model adopts single precision (Float 32) Double precision (Double) data type , As a result, the pre training model has high requirements for machine memory . In order to further reduce the memory overhead 、 To speed up the FinBERT Pre training and reasoning speed , We used the latest Tesla V100GPU Mixed precision training . Mixed precision training means FP32 and FP16 Mixed training , Using mixed precision training can speed up the training process and reduce the memory cost , To two or morethings FP32 Stability and FP16 The speed of . In case of ensuring the accuracy of the model does not decrease , Reduce the memory consumption of the model by about half , Improve the training speed of the model by about 3 times .

Downstream task experiment results

To compare the baseline effect , We have abstracted four typical data sets in the financial field from the actual business of entropy technology , It includes sentence level and text level tasks . On this basis , We will FinBERT And Google Native Chinese BERT、 Harbin Institute of technology iFLYTEK laboratory open source BERT-wwm and RoBERTa-wwm-ext These three models, which are widely used in the Chinese field, are tested for downstream tasks . In the experiment , In order to maintain the fairness of the test , We didn't further optimize the optimal learning rate , For all four models BERT-wwm The best learning rate of :2e-5.

All the experimental results are the average of the five test results , In brackets is the maximum of the five test results , The evaluation index is F1-score.

3.1 Experiment 1 : Classification of financial SMS types

(1) Experiment task

This task comes from the products related to entropy reduction technology information flow , Its core task is to classify financial short texts according to the text content , tagged , In order to facilitate users more timely 、 More accurate access to the content of interest .

We simplified the original task , From the original 15 Out of the categories, the most difficult one is 6 Experiments in different categories .

(2) Data sets

The data set of this task contains 3000 Samples , The training set data is about 1100 strip , Test set data is about 1900 strip , The distribution of each category is as follows :

(3) experimental result

TASK\MODEL

Google-bert

BERT-wwm

RoBERTa-wwm-ext

FinBERT

Classification of financial SMS types

0.867(0.874)

0.867(0.877)

0.877(0.885)

0.895(0.897)

3.2 Experiment two : Financial SMS industry classification

(1) Experiment task

The core task of this task is to classify the financial short text according to the text content , Take CITIC's first-class industry classification as the classification benchmark , Including catering tourism 、 Business retail 、 Textile and clothing 、 Agriculture, forestry and fishing 、 Architecture 、 Petroleum and petrochemicals 、 signal communication 、 Computers, etc 28 Industry categories , It can be used in financial public opinion monitoring 、 Research Report / Announcement intelligent search and other downstream applications .

(2) Data sets

The data set of this task contains 1200 Samples , The training set data is about 400 strip , Test set data is about 800 strip . The number of categories in the training set is in 5~15 Between strips , It's a typical small sample task .

The distribution of each category is as follows :

(3) experimental result

TASK\MODEL

Google-bert

BERT-wwm

RoBERTa-wwm-ext

FinBERT

Financial SMS industry classification

0.939(0.942)

0.932(0.942)

0.938(0.942)

0.951(0.952)

3.3 Experiment three : Financial sentiment classification

(1) Experiment task

This task comes from the entropy of science and technology financial quality control products , Its core task is to classify the financial emotion according to the content of the text , And used in the follow-up market sentiment observation and stock correlation analysis .

The task has in common 4 Categories , Corresponding to different emotional polarity and intensity .

(2) Data sets

The data set of this task contains 2000 Samples , The training set data is about 1300 strip , Test set data is about 700 strip , The distribution of each category is as follows :

(3) experimental result

TASK\MODEL

Google-bert

BERT-wwm

RoBERTa-wwm-ext

FinBERT

Classification of financial SMS types

0.862(0.866)

0.85(0.860)

0.867(0.867)

0.895(0.896)

3.4 Experiment four : Named entity recognition in the field of finance

(1) Experiment task

This task comes from the entropy simplification of the products related to the map of scientific and technological knowledge , The core task of this paper is to analyze the entities in the text of financial category ( Name of company or person ) To identify and extract entities , It is mainly used in entity extraction and entity linking of knowledge map .

(2) Data sets

The dataset contains 24000 Samples , Among them, the training set data is 3000 strip , Test set data total 21000 strip .

(3) Result display

TASK\MODEL

Google-bert

BERT-wwm

RoBERTa-wwm-ext

FinBERT

Company name entity identification

0.865

0.879

0.894

0.922

Character name entity recognition

0.887

0.887

0.891

0.917

3.5 summary

In this baseline test , We start with four kinds of actual business problems and data encountered in financial scenarios to conduct comparative experiments , Including financial SMS type classification task 、 Financial text industry classification 、 Financial sentiment analysis task and financial entity identification task . contrast FinBERT and Google Native Chinese BERT、 BERT-wwm、RoBERTa-wwm-ext The three general models of pre training in this field

You know ,FinBERT The effect is improved significantly , stay F1-score The average can be improved 2~5.7 percentage .

Conclusion

This article introduces in detail FinBERT Open source background 、 Training details and results of four kinds of comparative experiments , Next , Entropy simplification AI The team will start from the expected size 、 Training time 、 More innovation and exploration are made in the pre training mode , In order to develop a better understanding of the financial field of pre training model , And release it at the right time FinBERT 2.0、FinBERT 3.0.

reference

[1]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). https://doi.org/arXiv:1811.03600v2 arXiv:1810.04805

[2]Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics

[3]Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv:1904.05342.

[4]Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: Pretrained language model for scientific text. In Proceedings ofEMNLP.

[5]Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019.

[6]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pre-training approach. arXiv preprint arXiv:1907.11692, 2019.

[7]Micikevicius, Paulius, et al. “Mixed precision training.” arXiv preprint arXiv:1710.03740 (2017).

[8]https://github.com/ymcui/Chinese-BERT-wwm/

[9]https://github.com/huggingface/transformers

 

 More highlights 

版权声明
本文为[osc_j3111wl4]所创,转载请带上原文链接,感谢