当前位置：网站首页>Clip learning notes
Clip learning notes
2022-05-14 13:58:00【be_ humble】
The paper ：Learning Transferable Visual Models From Natural Language Supervision
openAI stay 2021 year 2 Published in September NLP and CV It's a very important job in the world
Mainly through 4 100 million text picture pairs , Pre training for comparative learning , Get words and pictures embedding converter , In picture classification zero-shot Achieve good results on , You can also take on a lot of work in the future .
clip Of github Address openai/CLIP: Contrastive Language-Image Pretraining (github.com)
clip Official website CLIP: Connecting Text and Images (openai.com)
I'm learning CLIP Before , Let's play first , Here you can see clip github The interaction code given on stay colab notebook Try loading the model on .
I base clip The interaction code written by the official is rewritten colab notebook Code address
Official classification renderings
My own results
The Kobe model was recognized at a glance , It has the function of celebrity recognition , Luffy also recognized ,one piece There is a certain probability , without Luffy It is estimated that one piece The probability is relatively high .
I recommend that you take a look at b The station explains the video , Well said CLIP Intensive reading of the thesis paragraph by paragraph 【 Intensive reading 】_ Bili, Bili _bilibili
The figure above shows the overall structure of the model , In fact, the text goes through a text encoder Get the sentence vector , Here for 512 dimension , The picture passes by image encoder Get the picture vector 512 dimension , By calculating the similarity , Computational symmetry loss, Align text and pictures as positive samples , Others are trained as negative samples , The following figure for loss Computing pseudo code
The previous working data set mainly has three data sets ：
- MS-COCO (Lin et al., 2014)
- Visual Genome (Krishna et al., 2017)
- YFCC100M (Thomee et al., 2016)
The first two data sets are of high quality , But it's about 10w picture , and YFCC100M although 1 Billion pictures , But the quality is poor. After filtration, only 1500w A picture .OpenAI Take a look 1500 All the data , Not enough for me to stuff my teeth , So the following data set was born .
WIT(WebImageText） OpenAI Build through the Internet 4 Hundred million pairs （ picture , Text ） Data sets , adopt 500,000 Inquire about , Each query gets 20,000 An image text pair .
- When processing pictures , Only random clipping is used for data enhancement , Finally, control softmax in logits Scope temperature Parameters are used as training parameters , Avoid the huge cost of parameter adjustment .
- Image Encoder ResNet-50 ,Vit , Enter image size 【3,224,224】
- Text Encoder Use 12 layer 512 hidden_size Of transformer, Use BPE code , Vocabulary size 49,152, The maximum sequence length is 76, Use mask self-attention To preserve model initialization , And you can set auxiliary goals to help text Modeling , The author will this part left as future work.
- To save memory and speed up training , Use gradient checkpoint,half-precision Adam statistics,fp-16
- The effect of this model is not as good as that of zero sample , The problems to be solved are put forward on how to use a small number of samples for domain optimization on this model .
- The effect is not as good as the current state-of-the-art, Of course, if direct zero samples are not optimized in various fields , can beat The best current model , Then others don't have to play , But the feeling is enough , The rest is to increase the amount of data , The model size , It's not something that poor people like us think about , Let's consider the optimization slightly from the method .
The reason for choosing comparative learning , Instead of predictive learning , Because comparative learning is more efficient than prediction 4 About times .
- Give me a video , Find the frame described by the text , When checking the monitoring, you can quickly find it through text description , Like traffic video , Find the frame of vehicle violation , Maybe the current model doesn't quite understand the meaning of violation , You should be able to find the frames of different car collisions .
- Image Retrieval , This gives a very large picture library , Want to retrieve the picture you want , If these pictures don't have a title , Describe something , use clip It should be the best choice at present .
- VQA Mission , Set questions prompt, Then add candidate label Analyze the similarity , This may be due to promt The setting of is fixed , The performance will not be very good , But we can find ways to improve .
Of course, there are many applications , There are directly related pictures and words embedding generator , If you connect various generation models and other models ,create as you want.
I think this paper is also a classic work of rich people , But the happiness of the rich is so simple , The effect is really very good , The follow-up has also become the basis of a lot of work , Next, I will also refer to some follow-up work , Then do your own experiments ,DALLE 2.0 I haven't played yet , I feel that effect is really good .
- CET-6 六级考试必备范文10篇
- tensorflow学习6 -- 跑通UNet图像分割
- （Transfer Learning and fine tuning）迁移学习与微调
- ICDAR 2021竞赛 科学文献分析——表格识别综述部分（剩余部分是文档布局分析）
- Why can 128 KB soul duel achieve such a long plot?
- The price rise twice a year highlights the greed of the introduction of Jidian, and the global chip hopes to find another way out
- Second week project training report
- Third week project training report
- Fourth week project training report
- Arm is still brilliant, but it is already under siege
- Performance hit everyone, Foxconn and apple are still close, and Lixun precision seems to have failed to take advantage of the opportunity
- 5g fell from the altar. Compared with 4G, the available technology is limited, and the only advantage is that it is faster
- A new force in the field of HPC -- Fu force supercomputing
- Data center white paper (2022): the data center industry continues to upgrade and fully enables the digital economy
- Error reading registry by C WPF application
- Teach you how to do prototype design
- P4 learning - Basic forwarding
- The contents of the input box are displayed on the right
- Clickhouse 22.3 lts release
- Programmer flirting special ~ ~ ~ nice H5 cube creative photo album, resources free!!! A gift from a programmer to a girl is very suitable for a young lady!
- [missing scan tool] awvs, appscan download and installation (with network disk link)
- 5.3 binary tree_ Code implementation of optimized heap and Top-k problem
- 5.4 binary tree_ Code implementation of various traversal and calculation
- Record: com mysql. cj. jdbc. exceptions. CommunicationsException: Communications link failure... [effective through personal test]
- Record: 1221 - incorrect usage of Union and order by [effective through personal test]
- [force deduction] backtracking 1 - Foundation + combination
- What role does cloud computing play in building intelligence?
- Abstract - the shortest novel of 2016
- Fiddler packet capture guide 05: breaking points
- EDA technology and market analysis