# Paper reading (47):dtfd-mil: double tier feature interpretation multiple instance learning for histopathology

2022-06-23 18:03:22Inge

# 0 introduce

## 0.2 background

Learn by example (MIL) stay Pathological whole image (histopathology whole slide images, WSIs) The application of classification is becoming more and more mature . However , Such targeted research still faces some difficulties , Such as Small sample queue (small sample cohorts) . In this context ,WSI Images ( package ) Limited number , And leaflet WSI The resolution is huge , Further lead to a large number of cropped blocks ( example ).
Tips： I wonder if the expression of the small sample queue is accurate ; I downloaded it before WSI Images , A single sample may have more than one G, It's really scary

## 0.3 Method

By introducing Pseudo packet (pseudo-bags) To virtually increase the number of packages , On this basis, a double (double-tier) MIL Framework to make effective use of its inherent characteristics . Besides , In attention based MIL The calculation example probability is deduced under the framework , The derivation is used to help build and analyze the proposed framework .

## 0.4 Bib

@inproceedings{
Zhang:2022:double,
author		=	{
Hong Run Zhang and Yan Da Meng and Yi Tian Zhao and Yi Hong Qiao and Xiao Yun Yang and Sarah E Coupland and Ya Lin Zheng},
title		=	{
{
DTFD}-{
MIL}: {
D}ouble-tier feature distillation multiple instance learning for histopathology whole slide image classification},
journal		=	{
Computer Vision and Pattern Recognition},
year		=	{
2022}
}


# 1 introduce

Whole image (WSI) Annotation is one of the major challenges in the field of computer vision , It is widely used in histopathology , It promotes the improvement of digital pathology on pathologists' workflow and diagnostic decision-making , It also stimulates the understanding of WSI Requirements for intelligent or automatic analysis tools . Single sheet WSI It's too big , from 100M To 10G Unequal . Because of this unique nature , Existing machine learning methods , For example, it is unrealistic to use natural images and medical image models directly ; Deep learning models require large-scale data and high-quality annotations .But, Pixel level label pairs WSI It can only be (￣▽￣)". So , In this way Small amount of annotation The question has aroused the great enthusiasm of researchers of deep learning , Such as weak supervision and semi supervision , And most of the weak supervision WSI Research can be characterized as MIL Research . stay MIL Within the framework of , One WSI As a package , It can contain thousands of blocks ( example ). As long as at least one instance is positive , Then WSI Being positive .
In the field of computer vision , There are many ways to MIL Try the problem . However ,WSI The innate nature of determines MIL Under the WSI The classification scheme is not as simple as other computer vision sub fields , because The only direct guidance information for training is hundreds of WSI The label of . This can lead to over fitting problems , That is, the machine learning model tends to fall into local minimum in the optimization process , The correlation between the learned characteristics and the target disease is low , So as to reduce the generalization ability of the model .
To solve the over fitting problem ,MIL Next WSI The guiding ideology of the research is to learn more information from fewer tags . Mutual example relationship (mutual-instance relation) Is one of the effective methods , Can be specified as space or feature distance , Or learn through neural modules , Such as cyclic neural network 、 converter , And graph neural networks .
Most of the existing methods can be classified as Based on the attention mechanism Of (attention-based, AB_MIL), The main difference is in the calculation of attention score . However , stay AB-MIL It is considered infeasible to explicitly infer the instance probability under the framework , And as an alternative , Attention scores are often used as an indicator of positive activation . In this paper , We think Attention score Not a strict measure for this purpose , But in AB-MIL Derived under the framework Instance probability .
Given an oversize WSI, the Direct processing unit It's from WSI Polar blocks cropped in . by WSI Born MIL The purpose of the model is to identify the most distinctive blocks , Because it is most likely to trigger the tag of the package . However ,WSI There are fewer of them , There are countless blocks , And the label information is WSI Grade . Besides , Pathology WSI in , Positive examples corresponding to the lesion area often occupy only a small part of the tissue , This further leads to a very small number of positive instances . therefore , In cases where over fitting is most likely to result , It's still exciting to identify these positive examples

In recent years , Although there are many ways to use Mutual example information To enhance MIL performance , But they did not explicitly address the above reasons WSI Problems caused by essential characteristics . In order to alleviate the negative effects of these problems , We introduce... Into the algorithm framework Pseudo packet The concept of , That is, randomly divide a sheet WSI Examples in , The partition result corresponds to the pseudo packet . Each pseudo package will be assigned its parent package , That is, the label of the original package . This method can organically increase the number of packages , And ensure that there are only a few instances in the pseudo package , This is us Double layer characteristic distillation MIL Model Great idea of , Such as chart 1. In particular , One 1 Hierarchy AB-MIL The model is applied to all WSI In the pseudo package . However , There is one Risk issues It is a pseudo package from a positive package. In fact, there may be no positive instances in the pseudo package , In this way, it is assigned a wrong label .
Old fellow iron

chart 1： The proposed method is different from the traditional method MIL Different

To solve this problem , We distill an eigenvector from each pseudo packet , And build a vector like this 2 Hierarchy AB-MIL Model , Such as chart 3. After such distillation ,1 The hierarchical model will provide clear features , In order to offer 2 The hierarchical model obtains a better representation of the parent package . Besides , For characteristic distillation , We use deep learning features for visualization Grad-CAM ( Gradient based category activation graph , grad-based class activation map) The basic idea of the model , stay AB-MIL Within the framework of The instance probability is derived .

chart 3：DTFD-MIL General framework . The collection of some instances starts with WSI Crop in organization area , Here are only nine . All these instances will be further divided , obtain M ( Such as 3) Pseudo packets .1 Hierarchy AB-MIL Get the eigenvectors of all pseudo packets , And as a 2 Hierarchy AB-MIL The input of . The real label of the package is used to supervise the prediction label of the two-layer model

Essentially , Let's look at it from a novel perspective , That is, use double layers MIL Frame to deal with WSI problem , The main The contributions are as follows
1） Introduce the concept of pseudo package , In response to WSI Insufficient dilemma ;
2） utilize Grad-CAM The basic idea of , from AB-MIL From the point of view of, the instance probability is directly derived , This can be used as a lot in the future MIL Extension of method ;
3） Push the probability to , Developed a two-tier MIL frame , And in two large public WSI The data set shows its advantages .

# 2 Method

## 2.1 review Grad-CAM and AB-MIL

An end-to-end deep learning image classification model usually includes two modules , That is, for high-level feature extraction Deep convolution network (deep convolution neural network, DCNN) And for classification Multi layer perception (multi-layer perceptron, MLP). An image is fed to DCNN Multiple feature maps can be obtained after , And an eigenvector can be obtained through the pooling function . In this way, the eigenvector is handed over to MLP, You can get the category probability 🤩, Such as chart 2 (a).

chart 2：(A) Description of deep learning image classification model . Global average pooling is used to extract the feature map of the whole image , And further obtain the eigenvector . Eigenvectors are passed to MLP Get category probability .(B) AB-MIL explain . The extracted features of the instance are weighted by the attention score , The weighted average results of all instances are used as a new representation of the package , And then turn to MLP Output packet prediction .

hypothesis DCNN The output characteristic diagram is U ∈ R D × W × H U\in\mathbb{R}^{D\times W\times H} , among D D Number of channels , D D and H H Is the dimension size . stay U U Applying global average pooling on the packet will obtain the eigenvector representing the packet ：
f = GAP W , H ( U ) ∈ R D (1) \tag{1} \boldsymbol{f}=\text{GAP}_{W,H}(U)\in\mathbb{R}^D among GAP W , H ( U ) \text{GAP}_{W,H}(U) About W , H W,H Average pooling of , namely f \boldsymbol{f} Of the d d Elements f d = 1 W H ∑ w = 1 , h = 1 W , H U w , h d f_d=\frac{1}{WH}\sum_{w=1,h=1}^{W,H}U_{w,h}^d . Use f \boldsymbol{f} As input ,MLP Export category c ∈ { 1 , 2 , c … , C } c\in\{1,2,c\dots,C\} The logical value of s c s^c , It indicates that the current attribute belongs to c c Class signal strength , It can be done by softmax Operation to obtain the predicted category probability . be based on Grad-CAM Of the c c Class category activation graph is defined as the weighted sum of feature graph ：
L c = ∑ d D β d c U d , β d c = 1 W H ∑ w , h W , H ( ∂ s c ∂ U w , h d ) (2) \tag{2} \boldsymbol{L}^c=\sum_{d}^D\beta_d^cU^d,\qquad\beta_d^c=\frac{1}{WH}\sum_{w,h}^{W,H}\left( \frac{\partial s^c}{\partial U_{w,h}^d} \right) among L c ∈ R W × H \boldsymbol{L}^c\in\mathbb{R}^{W\times H} , L w , h c L_{w,h}^c yes L c \boldsymbol{L}^c It's in position w , h w,h Amplitude value of , Indicates that this position converges to the category c c Intensity of ：
L w , h c = ∑ d = 1 D β d c U w , h d (3) \tag{3} L_{w,h}^c=\sum_{d=1}^D\beta_d^cU_{w,h}^d

### 2.1.2 AB-MIL

Given that there is K K Package of instances X = { x 1 , x 2 , … , x K } X=\{x_1,x_2,\dots,x_K\} , Each instance x k , k ∈ 1 , 2 , … , K x_k,k\in1,2,\dots,K Hold hidden Tags y k y_k ( Unknowable ), among y k = 1 y_k=1 Express positive , = 0 =0 Negative .MIL The goal of is to detect whether the package contains at least one positive instance . The only thing you can use during the training phase is Package label , It is defined as ：
Y = { 1 , if  ∑ k = 1 K y k > 0 0 , otherwise (4) \tag{4} Y=\left\{ \begin{array}{ll} 1,&\qquad \text{if}\ \sum_{k=1}^Ky_k>0\\ 0,&\qquad\text{otherwise} \end{array} \right. A simple way to solve this problem is to assign the label of the corresponding package to the instance , And train the classifier , Finally, through average pooling or maximum pooling, the predicted result of aggregation instances is packet labels . Another strategy is to use the learning package to express F \boldsymbol{F} , Thus, the problem is simplified to the traditional classification task . This strategy is more effective , It can be seen as MIL Embedded learning is a kind of . Packet embedding Is customized as ：
F = G ( { h k ∣ k = 1 , 2 , … , K } ) (5) \tag{5} \boldsymbol{F}=\text{G}(\{\boldsymbol{h_k|k=1,2,\dots,K}\}) among G \text{G} Is the aggregation function , h k ∈ R d \boldsymbol{h}_k\in\mathbb{R}^d Is the instance k k Feature extraction . The typical convergence function is the attention mechanism ：
F = ∑ k = 1 K α k h k ∈ R D (6) \tag{6} \boldsymbol{F}=\sum_{k=1}^K\alpha_k\boldsymbol{h}_k\in\mathbb{R}^D among α k \alpha_k Is the instance h k \boldsymbol{h}_k Acquisition weight of , D D It's a vector F \boldsymbol{F} and h k \boldsymbol{h}_k Dimensions . Such a mechanism, such as chart 2 (b) Shown . There are many ways to calculate attention scores , For example, classic AB-MIL The weight of is calculated as ：
α k = exp ⁡ { w T ( tanh ⁡ ( V 1 h k ) ⊙ sigm ( V 2 h k ) ) } ∑ j = 1 K exp ⁡ { w T ( tanh ⁡ ( V 1 h j ) ⊙ sigm ( V 2 h j ) ) } (7) \tag{7} \alpha_k=\frac{\exp\{ \boldsymbol{w}^T(\tanh (\boldsymbol{V}_1\boldsymbol{h}_k) \odot\text{sigm}(\boldsymbol{V}_2\boldsymbol{h}_k)) \}}{\sum_{j=1}^K\exp\{ \boldsymbol{w}^T(\tanh (\boldsymbol{V}_1\boldsymbol{h}_j) \odot\text{sigm}(\boldsymbol{V}_2\boldsymbol{h}_j)) \}} among w \boldsymbol{w} V 1 \boldsymbol{V}_1 , as well as V 2 \boldsymbol{V}_2 Is the acquisition parameter .

## 2.2 AB-MIL Derivation of case probability in

Even though MIL The packet embedding method has excellent performance , However, it seems infeasible to calculate the probability of instance category . This paper proves that in AB-MIL It is feasible to obtain the prediction probability of a single instance , Prove slightly . therefore , application Grad-CAM To AB-MIL It is feasible to directly infer the signal strength of an instance belonging to a certain category . And formula 2 similar , example k k Belong to category c c Of Signal strength Can be recorded as ：
L k c = ∑ d = 1 D β d c h ^ k , d , β d c = 1 K ∑ i = 1 K ∂ s c ∂ h ^ k , d (8) \tag{8} L_k^c=\sum_{d=1}^D\beta_d^c\hat{h}_{k,d},\qquad\beta_{d}^c=\frac{1}{K}\sum_{i=1}^K\frac{\partial s_c}{\partial\hat{h}_{k,d}} among s c s_c yes MIL Classifiers about categories c c Output logic of 、 h ^ k , d \hat{h}_{k,d} yes h ^ k \hat{\boldsymbol{h}}_k The elements of , as well as h ^ k = α k K h k \hat{\boldsymbol{h}}_k=\alpha_kK\boldsymbol{h}_k . By using softmax function , Instance belongs to the third c c The prediction probability of is ：
p k c = exp ⁡ ( L k c ) ∑ t = 1 C exp ⁡ ( L k t ) (9) \tag{9} p_k^c=\frac{\exp(L_k^c)}{\sum_{t=1}^C\exp(L_k^t)}

## 2.3 Double layer characteristic distillation MIL

Given N N A package (WSI), Each bag has K n K_n An example , namely X n = { x n , k ∣ k = 1 , 2 , … , K n } , n ∈ { 1 , 2 , … , N } \boldsymbol{X}_n=\{ x_{n,k} | k=1,2,\dots,K_n\},n\in\{ 1,2,\dots,N \} , Y n Y_n Represents the real label of the package . The characteristics corresponding to each instance are recorded as h n , k \boldsymbol{h}_{n,k} , It is composed of neural network H \mathbf{H} extract , namely h n , k = H ( x n , k ) \boldsymbol{h}_{n,k}=\boldsymbol{H}(x_{n,k}) . The instances in each package are randomly divided into M M Pseudo packets , The number of instances in the package is roughly even , X n = { X n m ∣ m = 1 , 2 , … , M } \boldsymbol{X}_n=\{ \boldsymbol{X}_n^m | m = 1,2,\dots,M \} . The label of the pseudo package is marked as the label of its parent package , namely Y n m = Y n Y_n^m=Y_n .1 Hierarchy AB-MIL Model record T 1 \text{T}_1 , Used to process each pseudo packet , Then each pseudo packet passes T 1 \text{T}_1 The packet probability obtained is ：
y n m = T 1 ( { h k = H ( x k ) ∣ x k ∈ X n m } ) (10) \tag{10} y_n^m=\text{T}_1(\{ \boldsymbol{h}_k = \mathbf{H}(x_k)|x_k\in\boldsymbol{X}_n^m \})    T 1 \text{T}_1 The loss function of the layer is defined based on cross entropy ：
L 1 = − f r a c 1 M N ∑ n = 1 , m = 1 N , M Y n m log ⁡ y n m + ( 1 − Y n m ) log ⁡ ( 1 − y n m ) (11) \tag{11} \mathcal{L}_1=-frac{1}{MN}\sum_{n=1,m=1}^{N,M}Y_n^m\log y_n^m+(1-Y_n^m)\log(1-y_n^m) Then, the probability of each instance in the pseudo packet passes through the formula 8–9 get . Case based probability , The eigenvector of each pseudo packet can be obtained , Among them the first n n Number of packages m m The distillation result of a pseudo package is expressed as f ^ n m \hat{\boldsymbol{f}}_n^m . All distillation results are passed on to 2 Hierarchy AB-MIL T 2 \text{T}_2 , The result is the inference of each package label ：
y ^ n = T 2 ( { f ^ n m ∣ m ∈ ( 1 , 2 , … , M ) } ) (12) \tag{12} \hat{y}_n=\text{T}_2\left( \left\{ \hat{\boldsymbol{f}}_n^m | m \in (1,2,\dots,M) \right\} \right)    T 2 \text{T}_2 The loss of is defined as ：
L 2 = 1 N ∑ n = 1 N Y n log ⁡ y ^ n + ( 1 − Y n ) log ⁡ ( 1 − y ^ n ) (13) \tag{13} \mathcal{L}_2=\frac{1}{N}\sum_{n=1}^NY_n\log\hat{y}_n+(1-Y_n)\log(1-\hat{y}_n)    Classified Total loss by ：
L = arg min ⁡ θ 1 L 1 + arg min ⁡ θ 2 L 2 (14) \tag{14} \mathcal{L}=\argmin_{\boldsymbol{\theta}_1}\mathcal{L}_1+\argmin_{\boldsymbol{\theta}_2}\mathcal{L}_2 among θ 1 \boldsymbol{\theta}_1 and θ 2 \boldsymbol{\theta}_2 It's network parameters .
It should be noted that there are a large number of noise tags in the pseudo packet , Random partitioning does not guarantee that every positive and pseudo packet contains at least one positive instance . Deep learning has a tolerance for noise labels . Besides , The noise level can be roughly the same as M M hook , Ablation experiments will then be used to evaluate M M Impact on final performance .
Four characteristic distillation strategies will be considered ：
MaxS (maximum selection)： T 1 \text{T}_1 After processing , The characteristics of instances with maximum positive probability in pseudo packets are passed to T 2 \text{T}_2 ;
MaxMinS (maxMin selection)： Choose two ;
MAS (maximum attention score selection)： Choose the one with the largest attention score ;
AFS (aggregated feature selection)： Through the formula 6 Converge .

https://chowdera.com/2022/174/202206231638581287.html