当前位置:网站首页>Optimal pool strategy for learning visual semantic embedding (CS CV)

Optimal pool strategy for learning visual semantic embedding (CS CV)

2020-12-07 19:23:45 Ling Qian

Visual semantic embedding (VSE) It is a mainstream method of visual language retrieval , The goal is to learn a deep embedding space , Make visual data embedded near semantic text tags or descriptions . Current VSE The model uses complex methods to better integrate multimodal features into the whole embedding process . However , We found that , In different feature extractors , It's very simple ( But carefully chosen ) Global pool function of ( for example ,max pool ) Better than the more complex models . Although it's simple and effective , But finding the best pool capabilities for different data patterns and feature extractors is expensive and tedious , Especially when features are of different sizes ( Text 、 video ). therefore , We propose a generalized pool operator (GPO), It learns to automatically adapt the best pool strategy to different characteristics , There's no need to manually adjust , Be effective and efficient at the same time . We take advantage of the proposed GPO Expanded VSE Model , And record it as VSE∞. No bells and whistles ,VSE∞ Images across popular feature extractors - On the basis of text retrieval , Significantly better than before VSE Method . Simple adaptability ,VSE Variants ∞ Through the implementation on two video text retrieval datasets , It further proves its advantages . The comprehensive experiment and visualization results confirm that ,GPO Always find the best pool strategy , As a standard VSE The plug and play feature aggregation module of the model .

Original title :Learning the Best Pooling Strategy for Visual Semantic Embedding

original text :Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE∞. Without bells and whistles, VSE∞ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE∞ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models.

Original author :Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang

Original address :https://arxiv.org/abs/2011.04305

Original statement , This article is authorized by the author + Community publication , Unauthorized , Shall not be reproduced .

If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

版权声明
本文为[Ling Qian]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/11/20201119031742510c.html