Timezone: »
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.We will opensource our code and trained models to reproduce the reported results.
Author Information
Alaaeldin Ali (Inria Paris)
Hugo Touvron (Facebook AI Research / Sorbonne University)
Mathilde Caron (INRIA)
Piotr Bojanowski (Facebook)
Matthijs Douze (Facebook AI Research)
Armand Joulin (Facebook AI research)
Ivan Laptev (INRIA)
Natalia Neverova (Facebook AI Research)
Gabriel Synnaeve (Facebook)
Jakob Verbeek (INRIA)
Herve Jegou (Facebook AI Research)
More from the Same Authors
-
2021 Spotlight: Instance-Conditioned GAN »
Arantxa Casanova · Marlene Careil · Jakob Verbeek · Michal Drozdzal · Adriana Romero Soriano -
2021 : ResNet strikes back: An improved training procedure in timm »
Ross Wightman · Hugo Touvron · Herve Jegou -
2022 Poster: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2022 Spotlight: Lightning Talks 4B-3 »
Zicheng Zhang · Mancheng Meng · Antoine Guedon · Yue Wu · Wei Mao · Zaiyu Huang · Peihao Chen · Shizhe Chen · Yongwei Chen · Keqiang Sun · Yi Zhu · chen rui · Hanhui Li · Dongyu Ji · Ziyan Wu · miaomiao Liu · Pascal Monasse · Yu Deng · Shangzhe Wu · Pierre-Louis Guhur · Jiaolong Yang · Kunyang Lin · Makarand Tapaswi · Zhaoyang Huang · Terrence Chen · Jiabao Lei · Jianzhuang Liu · Vincent Lepetit · Zhenyu Xie · Richard I Hartley · Dinggang Shen · Xiaodan Liang · Runhao Zeng · Cordelia Schmid · Michael Kampffmeyer · Mathieu Salzmann · Ning Zhang · Fangyun Wei · Yabin Zhang · Fan Yang · Qifeng Chen · Wei Ke · Quan Wang · Thomas Li · qingling Cai · Kui Jia · Ivan Laptev · Mingkui Tan · Xin Tong · Hongsheng Li · Xiaodan Liang · Chuang Gan -
2022 Spotlight: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding »
Shizhe Chen · Pierre-Louis Guhur · Makarand Tapaswi · Cordelia Schmid · Ivan Laptev -
2022 Poster: Star Temporal Classification: Sequence Modeling with Partially Labeled Data »
Vineel Pratap · Awni Hannun · Gabriel Synnaeve · Ronan Collobert -
2022 Poster: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models »
Antoine Yang · Antoine Miech · Josef Sivic · Ivan Laptev · Cordelia Schmid -
2021 : Spotlight talk: ResNet strikes back: An improved training procedure in timm. »
Hugo Touvron -
2021 Poster: Hierarchical Skills for Efficient Exploration »
Jonas Gehring · Gabriel Synnaeve · Andreas Krause · Nicolas Usunier -
2021 : Image Similarity Challenge + Q&A »
Matthijs Douze · Zoe Papakipos · Cristian Canton · Lowik Chanussot · Giorgos Tolias · Filip Radenovic · Ondrej Chum -
2021 Poster: CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings »
Tatiana Likhomanenko · Qiantong Xu · Gabriel Synnaeve · Ronan Collobert · Alex Rogozhnikov -
2021 Poster: Instance-Conditioned GAN »
Arantxa Casanova · Marlene Careil · Jakob Verbeek · Michal Drozdzal · Adriana Romero Soriano -
2021 Poster: History Aware Multimodal Transformer for Vision-and-Language Navigation »
Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev -
2021 : Billion-Scale Approximate Nearest Neighbor Search Challenge + Q&A »
Harsha Vardhan Simhadri · George Williams · Martin Aumüller · Artem Babenko · Dmitry Baranchuk · Qi Chen · Matthijs Douze · Ravishankar Krishnawamy · Gopal Srinivasa · Suhas Jayaram Subramanya · Jingdong Wang -
2021 Poster: Differentiable rendering with perturbed optimizers »
Quentin Le Lidec · Ivan Laptev · Cordelia Schmid · Justin Carpentier -
2020 Poster: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments »
Mathilde Caron · Ishan Misra · Julien Mairal · Priya Goyal · Piotr Bojanowski · Armand Joulin -
2020 Poster: Continuous Surface Embeddings »
Natalia Neverova · David Novotny · Marc Szafraniec · Vasil Khalidov · Patrick Labatut · Andrea Vedaldi -
2020 Session: Orals & Spotlights Track 07: Vision Applications »
Ce Liu · Natalia Neverova -
2019 : Carl Doersch, Raquel Urtasun, Sanja Fidler moderated by Natalia Neverova »
Raquel Urtasun · Sanja Fidler · Natalia Neverova · Ilija Radosavovic · Carl Doersch -
2019 Poster: Large Memory Layers with Product Keys »
Guillaume Lample · Alexandre Sablayrolles · Marc'Aurelio Ranzato · Ludovic Denoyer · Herve Jegou -
2019 Spotlight: Large Memory Layers with Product Keys »
Guillaume Lample · Alexandre Sablayrolles · Marc'Aurelio Ranzato · Ludovic Denoyer · Herve Jegou -
2019 Poster: Fixing the train-test resolution discrepancy »
Hugo Touvron · Andrea Vedaldi · Matthijs Douze · Herve Jegou -
2019 Poster: Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels »
Natalia Neverova · David Novotny · Andrea Vedaldi -
2019 Poster: A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning »
Nicolas Carion · Nicolas Usunier · Gabriel Synnaeve · Alessandro Lazaric -
2019 Spotlight: Adaptive Density Estimation for Generative Models »
Thomas Lucas · Konstantin Shmelkov · Karteek Alahari · Cordelia Schmid · Jakob Verbeek -
2019 Spotlight: A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning »
Nicolas Carion · Nicolas Usunier · Gabriel Synnaeve · Alessandro Lazaric -
2018 Poster: Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger »
Gabriel Synnaeve · Zeming Lin · Jonas Gehring · Dan Gant · Vegard Mella · Vasil Khalidov · Nicolas Carion · Nicolas Usunier -
2018 Poster: A flexible model for training action localization with varying levels of supervision »
Guilhem Chéron · Jean-Baptiste Alayrac · Ivan Laptev · Cordelia Schmid -
2017 : Poster Session - Session 2 »
Ambrish Rawat · Armand Joulin · Peter A Jansen · Jay Yoon Lee · Muhao Chen · Frank F. Xu · Patrick Verga · Brendan Juba · Anca Dumitrache · Sharmistha Jat · Robert Logan · Dhanya Sridhar · Fan Yang · Rajarshi Das · Pouya Pezeshkpour · Nicholas Monath -
2017 Poster: Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples »
Moustapha Cisse · Yossi Adi · Natalia Neverova · Joseph Keshet -
2017 Poster: Unbounded cache model for online language modeling with open vocabulary »
Edouard Grave · Moustapha Cisse · Armand Joulin -
2016 Workshop: Machine Intelligence @ NIPS »
Tomas Mikolov · Baroni Marco · Armand Joulin · Germán Kruszewski · Angeliki Lazaridou · Klemen Simonic -
2016 Poster: Convolutional Neural Fabrics »
Shreyas Saxena · Jakob Verbeek -
2015 Poster: Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets »
Armand Joulin · Tomas Mikolov -
2015 Spotlight: Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets »
Armand Joulin · Tomas Mikolov -
2011 Poster: Learning person-object interactions for action recognition in still images »
Vincent Delaitre · Josef Sivic · Ivan Laptev -
2010 Workshop: Beyond classification: Machine Learning for next generation Computer Vision challenges »
Craig Saunders · Jakob Verbeek · Svetlana Lazebnik -
2007 Oral: Scene Segmentation with CRFs Learned from Partially Labeled Images »
Jakob Verbeek · Bill Triggs -
2007 Poster: Scene Segmentation with CRFs Learned from Partially Labeled Images »
Jakob Verbeek · Bill Triggs