当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext

Multi classification of unbalanced text using AWS sagemaker blazingtext

2020-11-06 01:22:39 InfoQ

background

Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .

BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .

This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .

Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢