当前位置:网站首页>We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use

We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use

2020-11-08 13:02:01 U4u5y4 assault rifle

author | Devil 、 Zhang Qian

  source | Almost Human

Shanghai Jiaotong University researchers create a new open medical image data set MedMNIST, And Design 「MedMNIST Categorical decathlon 」, To promote AutoML Algorithm in the field of medical image analysis research .

stay AI In the development of Technology , Data sets play an important role . However , There are many difficulties in the creation of medical data sets , Such as data acquisition 、 Data tagging, etc .

In the near future , Researchers at Shanghai Jiaotong University created a medical image dataset MedMNIST, common contain 10 Preprocessing open medical image datasets ( Its data comes from many different data sources , And after pretreatment ).

Project address :

https://medmnist.github.io/

Address of thesis :

https://arxiv.org/pdf/2010.14925v1.pdf

GitHub Address :

https://github.com/MedMNIST/MedMNIST

Dataset download address :

https://www.dropbox.com/sh/upxrsyb5v8jxbso/AADOV0_6pC9Tb3cIACro1uUPa?dl=0

and MNIST The dataset is the same ,MedMNIST Data sets In lightweight 28 × 28 Performing classification tasks on images , The tasks involved cover the main medical image modes and diverse data scales . According to the researchers' design ,MedMNIST Data sets have the following features :

  • educative nature : The multimodal data in this dataset comes from multiple open medical image datasets with knowledge sharing license , It can be used for educational purposes .

  • Standardization : The researchers preprocessed the data , Convert it to the same format , therefore Users do not need to have background knowledge to use .

  • diversity : Multimodal datasets cover multiple data scales ( from 100 To 100,000) And tasks ( Two classification / Many classification 、 Ordered regression and multi label ).

  • Lightweight : The image size is 28 × 28, It is convenient for rapid prototyping and testing, and multimodal machine learning and AutoML Algorithm .

suffer Medical Segmentation Decathlon( Medical split decathlon ) Inspired by the , The study also designed MedMNIST Classification Decathlon(MedMNIST Categorical decathlon ), As AutoML Benchmark in the field of medical image classification .

It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , Including early stop ResNet [6]、 Open source AutoML Tools (auto-sklearn [7] and AutoKeras [8]), And commercialization AutoML Tools (Google AutoML Vision). The researchers hope that MedMNIST Classification Decathlon Can promote AutoML Research in the field of medical image analysis .

Ten preprocessed datasets

MedMNIST Data set containing 10 Preprocessing data sets , Covering the main data modes ( Such as X Photo chip 、OCT、 ultrasonic 、CT)、 Diverse classification tasks ( Two classification / Many classification 、 Ordered regression and multi label ) And data scale . As shown in the table 1 Shown , The diversity of data set design leads to the diversity of task difficulty , And that's what AutoML What benchmarks need . The researchers preprocessed each data set , Divide it into training - verification - Test subsets .

surface 1:MedMNIST Data set Overview , Covers the name of the dataset 、 source 、 Data mode 、 Task and dataset segmentation .

The data sets of these modes cover X Photo chip 、OCT、 ultrasonic 、CT、 Pathological section 、 Dermoscopy, etc , It's about colorectal cancer 、 Retinal diseases 、 Breast disease 、 Liver tumor and many other medical fields .

new type AutoML Medical image benchmark

As mentioned earlier , The researchers were inspired by the medical split decathlon , Designed 「MedMNIST Categorical decathlon 」, Designed to create lightweight... For medical image analysis AutoML The benchmark . It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , See the table below 2:

From the table 2 It can be seen that ,Google AutoML Vision The overall performance is good , But it's not always the best , Sometimes even lose to ResNet-18 and ResNet-50.auto-sklearn It doesn't perform well on most datasets , This shows that the performance of the typical statistical machine learning algorithm on the medical image data set is poor .AutoKeras Good performance on large data sets , Relatively poor performance on small data sets . No algorithm can achieve good generalization performance on these ten datasets , It helps to explore AutoML The algorithm is in different data modes 、 Generalization effects on task and scale datasets .

Next , Let's look at different methods in the training set 、 Performance on verification set and test set . Here's the picture 2 Shown , The algorithm is easy to over fit on small data sets .

Google AutoML Vision It can better control the over fitting problem , and auto-sklearn There is a serious over fitting . It can be inferred from this that , For learning algorithms , Appropriate reductive bias It's very important . We can still do that MedMNIST Explore different regularization techniques on datasets , Such as data enhancement 、 Model integration 、 Optimization algorithm, etc .

How to find data sets ?

Besides the medical field , Data sets from other fields are sometimes difficult to access , This requires us to master some common data collection methods and common resources . lately ,Medium A blogger on introduced several commonly used data collection sources :

1. Awesome Data

This is a GitHub The repository , Contains multiple different categories of datasets .

link :

https://github.com/awesomedata/awesome-public-datasets

2. Data Is Plural

This is a dataset resource presented in spreadsheet form , from 2015 It's been updated regularly since , The latest issue is 2020 year 10 month 28 The resources of the day , So some of the resources are very new .

link :https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0

3. Kaggle Datasets

Kaggle Datasets Provides preview and summary information about many datasets , Very suitable for retrieving data sets for specific topics .

link :

https://www.kaggle.com/datasets

4. Data.world

and Kaggle equally ,Data.world Provides a series of user contributed datasets , It also provides a platform for companies to store and organize their own data .

link :

https://data.world/

5. Google Dataset Search

Dataset search It's Google 2018 A new search function launched in . If you're looking for data from a particular topic or source , This tool is worth trying .

link :

https://datasetsearch.research.google.com/

6. OpenDaL

OpenDal It's also a dataset search tool , You can search in many ways , For example, according to the creation time or frame a certain area on the map .

link :

https://opendatalibrary.com/

7. Pandas Data Reader

Pandas Data Reader It can help you pull data from online resources , And then apply it to Python pandas DataFrame in . Most of this is financial data .

link :

https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

8. from API get data

utilize Python from API Data acquisition is also a common method used by data scientists , Please refer to the following tutorial for specific operation steps .

link :

https://towardsdatascience.com/how-to-get-data-from-apis-with-python-dfb83fdc5b5b

Reference link :https://towardsdatascience.com/the-top-10-best-places-to-find-datasets-8d3b4e31c442

????

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

About PaperWeekly

PaperWeekly It's a recommendation 、 Reading 、 Discuss 、 An academic platform for reporting the achievements of the frontier papers on artificial intelligence . If you study or engage in AI field , Welcome to clicking on the official account 「 Communication group 」, The little assistant will take you into PaperWeekly In the communication group .

版权声明
本文为[U4u5y4 assault rifle]所创,转载请带上原文链接,感谢