当前位置:网站首页>ICDAR 2021 competition scientific literature analysis - table identification summary (the rest is document layout analysis)

ICDAR 2021 competition scientific literature analysis - table identification summary (the rest is document layout analysis)

2022-05-14 14:00:25Zheng Jianyu JY

 Mission B For the table identification part , This article only looks at table recognition 

Abstract ( Is not important , If you want to look directly at the table recognition part, you can skip ).

The scientific literature contains important information about cutting-edge innovations in different fields . The development of automatic document processing has promoted the rapid development of natural language information processing . However , Scientific literature is usually in the form of Unstructured PDF Format Provide . although PDF Very suitable for saving basic visual elements on canvas , Such as character 、 line 、 Shape, etc. , To present to humans , But the machine is right PDF Automatic format processing brings many challenges . There are more than 2.5 One trillion PDF file , These problems are also common in many other important applications .

A key challenge in automatically extracting information from scientific literature is , Documents usually contain Unnatural language The content of , Such as Graphics and tables . However , These contents usually illustrate the key results of research 、 Information or summary . In order to fully understand the scientific literature , The automation system must be able to recognize the layout of documents , And parse the non natural language content into machine-readable format . our ICDAR 2021 Scientific literature analysis competition (ICDAR2021 SLP) To promote progress in document understanding .ICDAR221-SLP utilize PubLayNet and PubTabNet Data sets , Provide hundreds of thousands of training and evaluation examples . On mission A( Document layout recognition ) in , Submission with the highest performance combines object detection and professional solutions for different categories . stay Mission B“ Form identification ” in ,top Submission depends on the method of identifying table components and the post-processing method of generating table structure and content . The results of both tasks show impressive performance , It also opens up the possibility for the practical application of high performance .

1、 introduction ( Is not important , Skippable , Need to see again )

Portable document format (PDF) Documents can be found everywhere , The number of documents in multiple industries exceeds 2.5 One trillion [12], Including insurance documents 、 Medical documents and peer-reviewed scientific articles .PDF It is one of the main sources of online and offline knowledge . although PDF Very suitable for saving basic elements on canvas ( character 、 line 、 shape 、 Image, etc ), For different operating systems or devices for human use , But it's not a format that machines can understand .

At present, most document understanding methods rely on deep learning , This requires a lot of training examples . We use the PubMed Central1 Automatic generation of large data sets .PubMed Central National Institutes of health / A large collection of full-text articles in the biomedical field provided by the National Medical Library .

As of today ,PubMed Central Have 2476 Nearly of the journals 700 Ten thousand full-text articles , This makes it possible to study a large number of document understanding problems with different article styles . Our dataset uses PubMed Central Generated by a subset of , This subset is issued under a commercially available knowledge sharing license .

The competition is divided into two tasks , One is to understand the document layout by asking participants to identify several types of information in the document page ( Mission A), The other is by asking participants to generate table images HTML Version to understand the table ( Mission B).IBM Research AI The leaderboard system is used to collect and evaluate the information submitted by participants . The system is based on EvalAI2.

On mission A in , Participants can access all data except the basic facts of the final evaluation test set , Test set in PubLayNet Publish when available . On mission B in , We released the final evaluation test set three days before the participants submitted the final results . On mission a The evaluation phase of , We received it from 78 Submitted by a large number of participants from different teams 281 A submission . The results of these two tasks show that , The most advanced algorithms have impressive performance , A significant improvement over the previously reported results , This opens up the possibility for the practical application of high performance .

3、 Mission B—— Form identification

Tabular information is common in all kinds of documents . Compared with natural language , Tables provide a way to summarize large amounts of data in a more compact and structured format . The table also provides a format , Help readers find and compare information . This competition aims to promote Automatic recognition of unstructured forms The study of .

Participants in this task need to develop a model , The model can convert the images of table data into corresponding images HTML Code , This is PubMed Central 2021 Held HTML The table shows the... After the competition . Generated by task participants HTML The code should correctly represent the structure of the table and the content of each cell . The content of the cell should contain the definition text style ( Include bold 、 Italics 、 Delete 、 Superscripts and subscripts ) Of HTML Mark .HTML The code does not need to rebuild the appearance of the table , For example, border line 、 Background color or font 、 Font size or font color .

3.1、 Related work

There are other table recognition challenges , Mainly at the International Conference on document analysis and identification (ICDAR) On the organization .ICDAR 2013 The form competition is the first competition on form detection and recognition [5].ICDAR 2013 The table competition includes 156 A form , Methods used to evaluate form detection and form recognition ; However , No training data provided .ICDAR 2019 The form detection and recognition competition provides training for form detection and recognition 、 Verification and test samples ( A total of 3600 individual )[4]. Two types of documents , Historical handwriting and programming model , Are provided in image format .ICDAR 2019 The competition consists of three tasks :1) Determine the table area ;2) Identify a table structure with a given table region ;3) Identify the table structure without a given table region .ground truth Bounding box that includes only table cells , Exclude cell contents .

our Task B The competition presents a more challenging task : The model needs to rely only on table images , Identify the table structure and the cell contents of the table . let me put it another way , The model needs to infer the tree structure of the table and each leaf node ( Header \ Body cell ) Properties of ( Content 、 Row span 、 Column span ). Besides , We don't provide cell location 、 Adjacency or row / Column split middle comment , These are needed to train most existing table recognition models . We only provide the final result of the tree representation for supervision . We believe that this will motivate participants to develop new image to structure mapping models .

3.2、 data

This task uses PubTabNet Data sets (v2.0.0)[16].PubTabNet Contains more than 500k Training samples and 9k Validation samples , It provides ground truthHTML Code , And the location of non empty table cells . Participants can use training data to train their models , The validation data are used for model selection and super parameter adjustment .9k+ Final evaluation set ( Only images , No notes ) Before the end of the final evaluation stage of the finals 3 Day release . In the final stage, participants submitted their results on this episode .

Use TEDS( Similarity based on tree editing distance ) Metrics [16] Evaluate the submitted content . T E D S TEDS TEDS Use [11] The tree editing distance proposed in measures the similarity between two tables . The cost of insert and delete operations is 1. When e d i t edit edit The nodes no Replace with ns when , If no or ns No td, The price is 1. When no and ns All are td when , If no and ns The column span or row span of is different , Then the replacement cost is 1. otherwise , The alternative cost is no and ns Standardization between contents L e v e n s h t e i n Levenshtein Levenshtein similarity [9]( stay [0,1] in ). Last , Between two trees T E D TED TED The calculation for the
 Insert picture description here
among E d i t D i s t EditDist EditDist Represents the editing distance of the tree , ∣ T ∣ |T| T yes T T T Number of nodes in . The table recognition performance of a method on a set of test samples is defined as the relationship between the recognition result of each sample and the basic truth T E D S TEDS TEDS The average of the scores .

The competition is divided into three stages . The format verification stage runs through the whole competition , Participants can use the mini development set provided by us to verify whether their result files meet our submission requirements . The development phase starts from the beginning of the game to the end of the game 3 God . At this stage , Participants can submit the results of test samples , To validate their model . The final evaluation stage will be at the end of the competition 3 Days go on . At this stage, participants can submit the reasoning results of the final evaluation set . The final ranking and winning team are determined by the performance of the final evaluation stage . surface 3.2 Shows different tasks B The size of the different data sets used in the phase .

SplitSizePhase
Training500,777N/A
Development9,115N/A
Mini development20Format Verification Phase
Test9138Development
Final evaluation9064Final evaluation
 surface 3.2: Mission B Data set statistics  

3.3、 result

about Mission B, We have 30 Submitted by a team 30 A submission , For the final evaluation stage . In the final evaluation , Use T E D S TEDS TEDS Top 10 performance systems such as surface 4 Shown . Due to problems with the final evaluation data set , Areas not considered in the assessment , Mark in bold .

The first four systems have similar performance , And we see a more significant difference . As shown in the system description , They depend on a combination of several components , These components identify related components from the table image , Then combine them . Compared with the results previously reported , Using image to sequence method T E D S TEDS TEDS The performance of the indicator is better [17]. stay [17] in , The data set is comparable with the test set of this competition , And derived from PubMed Central.
 Insert picture description here
surface 4: The overall result ( T E D S TEDS TEDS all) Break down into simple and complex tables [16]

3.4、 System description ( It only describes part of )

Team: Davar-Lab-OCR, Hikvision Research Institute
Davar Lab OCR Thesis and source code
The table recognition framework consists of two main processes : Table cell generation and structure inference
(1) be based on MASK R-CNN Detection model establishment table cell generation . say concretely , The model has been trained , You can learn / Column aligned cell level bounding box , And the corresponding text content area mask . We introduce pyramid mask Supervision , And USES the HRNet-W48 cascade MASK R-CNN Large backbone to get reliable a l i g n e d aligned aligned b o u n d i n g bounding bounding b o x e s boxes boxes. Besides , We also train a single line text detection model and an attention based text recognition model to provide OCR Information . This can be achieved by selecting an instance that contains only one line of text . We also use multi-scale set to further improve the performance of cell and single line text detection model .
(2) In the structure inference stage , The bounding box of the cell can be adjusted horizontally according to the alignment overlap / Vertical connection . Then search through the largest group (Maximum Clique Search) The process generates lines \ Column information , In the process, you can easily find empty cells .
In order to deal with some special situations , We train another table detection model to filter text that does not belong to the table .

Team: VCGroup
VCGroup Github repo:
In our way [7,10,14] in , We divide the table content recognition task into four sub tasks : Table structure identification 、 Text line detection 、 Text line recognition and box assignment . Our table structure recognition algorithm is based on MASTER custom ,MASTER It is a robust image text recognition algorithm .PSENet Used to detect each line of text in the table image . For text line recognition , Our model is also based on MASTER. Last , In the text box assignment phase , We will PSENet The detected text box is associated with the structure item of table structure prediction reconstruction , And fill the recognized text line content into the corresponding item . Our proposed method is effective for 9115 Of a validation sample T E D S TEDS TEDS The score is 96.84%, In the final evaluation stage 9064 A sample of T E D S TEDS TEDS The score is 96.32%.

Team: Tomorrow Advancing Life(TAL)
TAL The system consists of two schemes :
(1) Detect through the meter 、 Line detection 、 Column detection 、 Cell detection and text line detection 5 A detection model is used to reconstruct the table structure . choice Mask R-CNN As this 5 A baseline for the detection model , Targeted optimization for different detection tasks . In the identification section , Input the results of unit detection and text line detection into CRNN In the model , Get the recognition result corresponding to each unit .
(2) The recovery of the table structure is regarded as img2seq problem . In order to shorten the decoding length , We replace the contents of each cell with different numbers . These numbers come from text line detection results . And then we use CNN Encode the image , The transformer model is used to decode the structure of the table . then , have access to CRNN The model obtains the corresponding text line content .
The above two schemes can get complete table structure and content recognition results . We have a set of selection rules , It combines the advantages of the two schemes , To output the best final result .

Team: PaodingAI, Beijing Paoding Technology Co., Ltd
The team : baoding AI, Beijing Baoding Technology Co., Ltd
baoding AI The system is divided into three main parts : Text block detection 、 Text block recognition and table structure recognition . The text block detector consists of MMDetection Detector cascade provided rcnn r50 2x Model training .
The text block recognizer consists of SAR TF model training . The table structure recognizer is our own recognition of [13] The implementation of the model proposed in . In addition to the above model , We also use models and rules to deal with simple classification 、<b And white space characters . Our system is not an end-to-end model , There is no integration method .

Team: Kaen Context, Kakao Enterprise
The company is located in the southwest of Gyeonggi do city, South Korea
In order to effectively solve the problem of table recognition , We used 12 The layer is limited to the linearity of the decoder transformer structure [8].
Data preparation : We use RGB Images ( No need to rescale ) As input condition , Consolidated HTML The code is used as the target text sequence . We reshape a tabular image into a series of flat patches (N,883), among 8 Is the width and height of each image patch ,N It's the number of patches . then , We use a linear projection layer to map the image sequence to 512 dimension . The target text sequence is converted into 512 Dimension embedding , And attached to the end of the projected image sequence . Last , We add different location codes to text and image sequences , So that our model can distinguish them .
Training : The stitched image text sequence is used as the input of the model , The model is trained by cross entropy loss under the teacher forced algorithm .
inference : The output of our model is through beam Search for sampling (beam=32).

reference :

  1. Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: 2009 10th International Conference on Document Analysis and Recognition. pp. 296–300.IEEE (2009)

  2. Clausner, C., Antonacopoulos, A., Pletschacher, S.: Icdar2017 competition on recognition of documents with complex layouts-rdcl2017.In: 2017 14th IAPR In- ternational Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1404–1410. IEEE (2017)

  3. Clausner, C., Papadopoulos, C., Pletschacher, S., Antonacopoulos, A.: The enp image and ground truth dataset of historical newspapers.In: 2015 13th International Conference on Document Analysis andRecognition (ICDAR). pp. 931–935.IEEE (2015)

  4. Gao, L., Huang, Y., Li, Y., Yan, Q., Fang, Y., Dejean, H., Kleber, F., Lang, E.M.:ICDAR 2019 competition on table detection and recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1510–1515. IEEE (Sep 2019).https://doi.org/10.1109/ICDAR.2019.00166

  5. G¨ obel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 201312th International Conference on Document Analysis and Recognition. pp. 1449–1453. IEEE (2013)

  6. Grygoriev, A., Degtyarenko, I., Deriuga, I., Polotskyi, S., Melnyk, V., Zakharchuk,D., Radyvonenko, O.: HCRNN: A novel architecture for fast online handwrittenstroke classification. In: Proc. of Int. Conf.on Document Analysis and Recognition(2021)

  7. He, Y., Qi, X., Ye, J., Gao, P., Chen, Y., Li, B., Tang, X., Xiao, R.: Pingan- vcgroup’s solution for icdar 2021 competition on scientific table image recognitionto latex. arXiv (2021)

  8. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fastautoregressive transformers with linear attention. In:International Conference onMachine Learning. pp. 5156–5165. PMLR(2020)

  9. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp.707–710. Soviet Union (1966)

  10. Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R., Bai, X.: Master: Multi-aspectnon-local network for scene text recognition.Pattern Recognition (2021)

  11. Pawlik, M., Augsten, N.: Tree edit distance: Robust and memory-efficient. Infor-mation Systems 56, 157–173 (2016)

  12. Staar, P.W., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: A machinelearning platform to ingest documents at scale. In:Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining. pp.774–782 (2018)ICDAR 2021Competition on Scientific Literature Parsing 13

  13. Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splittingand merging for table structure decomposition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR).pp. 114–121. IEEE (2019)

  14. Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., Xiao, R.: Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html. arXiv (2021)

  15. Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extrac-tor (gte): A framework for joint table identification and cell structure recognition using visual context. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp.697–706 (2021)

  16. Zhong, X., ShafieiBavani, E., Yepes, A.J.: Image-based table recognition: data,model, and evaluation. arXiv preprint arXiv:1911.10683 (2019)

  17. Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document lay-out analysis. In: 2019 International Conference on Document Analysis and Recog-nition (ICDAR). pp. 1015–1022. IEEE (2019)

原网站

版权声明
本文为[Zheng Jianyu JY]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/134/202205141344315865.html

随机推荐