当前位置:网站首页>Clip learning notes

Clip learning notes

2022-05-14 13:58:00be_ humble

CLIP Learning notes

The paper :Learning Transferable Visual Models From Natural Language Supervision

openAI stay 2021 year 2 Published in September NLP and CV It's a very important job in the world

Mainly through 4 100 million text picture pairs , Pre training for comparative learning , Get words and pictures embedding converter , In picture classification zero-shot Achieve good results on , You can also take on a lot of work in the future .

clip Of github Address openai/CLIP: Contrastive Language-Image Pretraining (github.com)

paper Address [2103.00020] Learning Transferable Visual Models From Natural Language Supervision (arxiv.org)

clip Official website CLIP: Connecting Text and Images (openai.com)

Playground

I'm learning CLIP Before , Let's play first , Here you can see clip github The interaction code given on stay colab notebook Try loading the model on .

I base clip The interaction code written by the official is rewritten colab notebook Code address

clip_playground.ipynb

Official classification renderings
 Here are some pictures in CIFAR100 The results of classification in the label

My own results

 My test results
The Kobe model was recognized at a glance , It has the function of celebrity recognition , Luffy also recognized ,one piece There is a certain probability , without Luffy It is estimated that one piece The probability is relatively high .

Model principle

I recommend that you take a look at b The station explains the video , Well said CLIP Intensive reading of the thesis paragraph by paragraph 【 Intensive reading 】_ Bili, Bili _bilibili

 Insert picture description here

The figure above shows the overall structure of the model , In fact, the text goes through a text encoder Get the sentence vector , Here for 512 dimension , The picture passes by image encoder Get the picture vector 512 dimension , By calculating the similarity , Computational symmetry loss, Align text and pictures as positive samples , Others are trained as negative samples , The following figure for loss Computing pseudo code

 Insert picture description here

Data sets

The previous working data set mainly has three data sets :

  • MS-COCO (Lin et al., 2014)
  • Visual Genome (Krishna et al., 2017)
  • YFCC100M (Thomee et al., 2016)

The first two data sets are of high quality , But it's about 10w picture , and YFCC100M although 1 Billion pictures , But the quality is poor. After filtration, only 1500w A picture .OpenAI Take a look 1500 All the data , Not enough for me to stuff my teeth , So the following data set was born .

WIT(WebImageText) OpenAI Build through the Internet 4 Hundred million pairs ( picture , Text ) Data sets , adopt 500,000 Inquire about , Each query gets 20,000 An image text pair .

Training details

  1. When processing pictures , Only random clipping is used for data enhancement , Finally, control softmax in logits Scope temperature Parameters are used as training parameters , Avoid the huge cost of parameter adjustment .
  2. Image Encoder ResNet-50 ,Vit , Enter image size 【3,224,224】
  3. Text Encoder Use 12 layer 512 hidden_size Of transformer, Use BPE code , Vocabulary size 49,152, The maximum sequence length is 76, Use mask self-attention To preserve model initialization , And you can set auxiliary goals to help text Modeling , The author will this part left as future work.
  4. To save memory and speed up training , Use gradient checkpoint,half-precision Adam statistics,fp-16

Insufficient

  • The effect of this model is not as good as that of zero sample , The problems to be solved are put forward on how to use a small number of samples for domain optimization on this model .
  • The effect is not as good as the current state-of-the-art, Of course, if direct zero samples are not optimized in various fields , can beat The best current model , Then others don't have to play , But the feeling is enough , The rest is to increase the amount of data , The model size , It's not something that poor people like us think about , Let's consider the optimization slightly from the method .

other

The reason for choosing comparative learning , Instead of predictive learning , Because comparative learning is more efficient than prediction 4 About times .

Practical application of the model

  • Give me a video , Find the frame described by the text , When checking the monitoring, you can quickly find it through text description , Like traffic video , Find the frame of vehicle violation , Maybe the current model doesn't quite understand the meaning of violation , You should be able to find the frames of different car collisions .
  • Image Retrieval , This gives a very large picture library , Want to retrieve the picture you want , If these pictures don't have a title , Describe something , use clip It should be the best choice at present .
  • VQA Mission , Set questions prompt, Then add candidate label Analyze the similarity , This may be due to promt The setting of is fixed , The performance will not be very good , But we can find ways to improve .

Of course, there are many applications , There are directly related pictures and words embedding generator , If you connect various generation models and other models ,create as you want.

summary

I think this paper is also a classic work of rich people , But the happiness of the rich is so simple , The effect is really very good , The follow-up has also become the basis of a lot of work , Next, I will also refer to some follow-up work , Then do your own experiments ,DALLE 2.0 I haven't played yet , I feel that effect is really good .

原网站

版权声明
本文为[be_ humble]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/134/202205141349446051.html

随机推荐