Transformer models have demonstrated excellent performance on a diverse set of computer vision applications ranging from classification to segmentation on various data modalities such as images, videos, and 3D data. The goal of this workshop is to bring together computer vision and machine learning researchers working towards advancing the theory, architecture, and algorithmic design for vision transformer models, as well as the practitioners utilizing transformer models for novel applications and use cases.
The workshop’s motivation is to narrow the gap between the research advancements in transformer designs and applications utilizing transformers for various computer vision applications. The workshop also aims to widen the adaptation of transformer models for various vision-related industrial applications. We are interested in papers reporting their experimental results on the utilization of transformers for any application of computer vision, challenges they have faced, and their mitigation strategy on topics like, but not limited to image classification, object detection, segmentation, human-object interaction detection, scene understanding based on 3D, video, and multimodal inputs.
Thu 11:00 p.m. - 11:10 p.m.
|
Opening Remarks
|
🔗 |
Thu 11:10 p.m. - 11:40 p.m.
|
[First Invited Talk] Ming Hsuan Yang
|
🔗 |
Thu 11:40 p.m. - 11:55 p.m.
|
CLUDA : Contrastive Learning in Unsupervised Domain Adaptation for Semantic Segmentation
(
1st Oral Presentation
)
In this work, we propose CLUDA, a simple, yet novelmethod for performing unsupervised domain adaptation(UDA) for semantic segmentation by incorporating con-trastive losses into a student-teacher learning paradigm,that makes use of pseudo-labels generated from the tar-get domain by the teacher network. More specifically, weextract a multi-level fused-feature map from the encoder,and apply contrastive loss across different classes and dif-ferent domains, via source-target mixing of images. Weconsistently improve performance on various feature en-coder architectures and for different domain adaptationdatasets in semantic segmentation. Furthermore, we intro-duce a learned-weighted contrastive loss to improve uponon a state-of-the-art multi-resolution training approachin UDA. We produce state-of-the-art results on GTA →Cityscapes (74.4 mIOU, +0.6) and Synthia → Cityscapes(67.2 mIOU, +1.4) datasets. CLUDA effectively demon-strates contrastive learning in UDA as a generic method,which can be easily integrated into any existing UDA forsemantic segmentation tasks. Please refer to the supple-mentary material for the details on implementation. |
Midhun Vayyat · Kasi Jaswin · Anuraag Bhattacharya · Shuaib Ahmed · Rahul Tallamraju 🔗 |
Thu 11:40 p.m. - 1:10 a.m.
|
[1st] Oral Presentation
|
🔗 |
Thu 11:55 p.m. - 12:10 a.m.
|
PatchBlender: A Motion Prior for Video Transformers
(
[1st] Oral Presentation
)
Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the performance of a ViT-B. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B. |
Gabriele Prato · Yale Song · Janarthanan Rajendran · R Devon Hjelm · Neel Joshi · Sarath Chandar 🔗 |
Fri 12:10 a.m. - 12:25 a.m.
|
Bi-Directional Self-Attention for Vision Transformers
(
[1st] Oral Presentation
)
Self-Attention (SA) maps a set of key-value pairs to an output by aggregating information from each pair according to its compatibility with a query. This allows SA to aggregate surrounding context (represented by key-value pairs) around a specific source (e.g. a query).Critically however, this process cannot also refine a source (e.g. a query) based on the surrounding context (e.g. key-value pairs). We address this limitation by inverting the way key-value pairs and queries are processed. We propose Inverse Self-Attention (ISA), which instead maps a query (source) to an output based on its compatibility with a set of key-value pairs (scene). Leveraging the inherent complementary nature of ISA and SA, we further propose Bi-directional Self-Attention (BiSA), an attention layer that couples SA and ISA by convexly combining their outputs. BiSA can be easily adapted into any existing transformer architecture to improve the expressibility of attention layers. We showcase this flexibility by extensively studying the effects of BiSA on CIFAR100[1], ImageNet1K[2], and ADE20K[3], and extend the Swin Transformer[4] and LeViT[5] with BiSA, and observe substantial improvements. |
George Stoica · Taylor Hearn · Bhavika Devnani · Judy Hoffman 🔗 |
Fri 12:25 a.m. - 12:40 a.m.
|
Video based Object 6D Pose Estimation using Transformers
(
[1st] Oral Presentation
)
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://anonymous.4open.science/r/VideoPose-3C8C. |
Apoorva Beedu · Huda Alamri · Irfan Essa 🔗 |
Fri 12:40 a.m. - 12:55 a.m.
|
End-to-end Multimodal Representation Learning for Video Dialog
(
[1st] Oral Presentation
)
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks. |
Huda Alamri · Apoorva Beedu · Irfan Essa · Anthony Bilic · Michael Hu 🔗 |
Fri 12:55 a.m. - 1:10 p.m.
|
Continual Transformers: Redundancy-Free Attention for Online Inference
(
[1st] Oral Presentation
)
Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance. |
Lukas Hedegaard · Arian Bakhtiarnia · Alexandros Iosifidis 🔗 |
Fri 1:10 a.m. - 1:40 a.m.
|
Break
(
1st Break
)
|
🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition
(
[1st] Poster session
)
Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future. |
Farrukh Rahman · Ömer Mubarek · Zsolt Kira 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
Fully-attentive and interpretable: vision and video vision transformers for pain detection
(
[1st] Poster session
)
Pain is a serious and costly issue globally, but to be treated, it must first be detected. Vision transformers are a top-performing architecture in computer vision, with little research on their use for pain detection. In this paper, we propose the first fully-attentive automated pain detection pipeline that achieves state-of-the-art performance on binary pain detection from facial expressions. The model is trained on the UNBC-McMaster dataset, after faces are 3D-registered and rotated to the canonical frontal view. In our experiments we identify important areas of the hyperparameter space and their interaction with vision and video vision transformers, obtaining 3 noteworthy models. We analyse the attention maps of one of our models, finding reasonable interpretations for its predictions. We also evaluate Mixup, an augmentation technique, and Sharpness-Aware Minimization, an optimizer, with no success. Our presented models, ViT-1 (F1 score 0.55 +- 0.15), ViViT-1 (F1 score 0.55 +- 0.13), and ViViT-2 (F1 score 0.49 +- 0.04), all outperform earlier works, showing the potential of vision transformers for pain detection. The code will be available upon acceptance. |
Giacomo Fiorentini · Itir Onal Ertugrul · Albert Ali Salah 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
DynamicViT: Making Vision Transformer faster through layer skipping
(
[1st] Poster session
)
The recent deep learning breakthroughs in language and vision tasks can be mainly attributed to large-scale transformers. Unfortunately, their massive size and high compute requirement have limited their use in resource-constrained environments. Dynamic neural networks could potentially reduce the amount of compute requirement by dynamically adjusting the computational path based on the input. We propose a layer skipping dynamic transformer network that skips layers for each sample based on decisions given by a reinforcement learning agent. Extensive experiment on CIFAR-10 and CIFAR-100 showed that this dynamic ViT model gained an average of 40\% speed increase evaluated on different batch sizes ranging from 1 to 1024. |
Amanuel Mersha · Samuel Assefa 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
FQDet: Fast-converging Query-based Detector
(
[1st] Poster session
)
Recently, two-stage Deformable DETR introduced the query-based two-stage head, a new type of two-stage head different from the region-based two-stage heads of classical detectors as Faster R-CNN. In query-based two-stage heads, the second stage selects one feature per detection processed by a transformer, called the query, as opposed to pooling a rectangular grid of features processed by CNNs as in region-based detectors. In this work, we improve the query-based head by improving the prior of the cross-attention operation with anchors, significantly speeding up the convergence while increasing its performance. Additionally, we empirically show that by improving the cross-attention prior, auxiliary losses and iterative bounding box mechanisms typically used by DETR-based detectors are no longer needed. By combining the best of both the classical and the DETR-based detectors, our FQDet head peaks at 45.4 AP on the 2017 COCO validation set when using a ResNet-50+TPN backbone, only after training for 12 epochs using the 1x schedule. We outperform other high-performing two-stage heads such as e.g. Cascade R-CNN, while using the same backbone and while being computationally cheaper. Additionally, when using the large ResNeXt-101-DCN+TPN backbone and multi-scale testing, our FQDet head achieves 52.9 AP on the 2017 COCO test-dev set after only 12 epochs of training. Code will be released. |
Cédric Picron · Punarjay Chakravarty · Tinne Tuytelaars 🔗 |
Fri 1:40 a.m. - 2:30 a.m.
|
[1st] Poster session
|
🔗 |
Fri 2:30 a.m. - 3:00 a.m.
|
[2nd Invited Talk] Cordelia Schmid
|
🔗 |
Fri 3:00 a.m. - 3:30 a.m.
|
[3rd Invited Talk] Rita Cucchiara
|
🔗 |
Fri 3:30 a.m. - 3:45 a.m.
|
Matryoshka Representations for Adaptive Deployment
(
[2nd] Oral Presentation
)
Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context, rigid fixed-capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14× smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14× real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities – vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). |
Aniket Rege · Aditya Kusupati · Gantavya Bhatt · Matthew Wallingford · Aditya Sinha · Vivek Ramanujan · William Howard-Snyder · Kaifeng Chen · Sham Kakade · Prateek Jain · Ali Farhadi
|
Fri 3:30 a.m. - 4:00 a.m.
|
[2nd] Oral Presentation
|
🔗 |
Fri 3:45 a.m. - 4:00 a.m.
|
TPFNet: A Novel Text In-painting Transformer for Text Removal
(
[2nd] Oral Presentation
)
Text erasure from an image is helpful for various tasks such as image editing and privacy preservation. In this paper, we present TPFNet, a novel one-stage (end-to-end) network for text removal from images. Our network has two parts. Since noise can be more effectively removed from low-resolution images, part 1 operates on low-resolution images. The output of part 1 is a low-resolution text-free image. Part 2 uses the features learned in part 1 to predict a high-resolution text-free image. In part 1, we use "pyramidal vision transformer" (PVT) as the encoder. Further, we use a novel multi-headed decoder that generates a high-pass filtered image and a segmentation map, in addition to a text-free image. The segmentation branch helps locate the text precisely, and the high-pass branch helps in learning the image structure. To precisely locate the text, TPFNet employs an adversarial loss that is conditional on the segmentation map rather than the input image. On Oxford, SCUT, and SCUT-EnsText datasets, our network outperforms recently proposed networks on nearly all the metrics. |
Onkar Susladkar · Dhruv Makwana · Gayatri Deshmukh · Sparsh Mittal · Sai Chandra Teja R · Rekha Singhal 🔗 |
Fri 4:00 a.m. - 4:30 a.m.
|
[4th Invited Talk] Kristen Grauman
|
🔗 |
Fri 4:30 a.m. - 5:00 a.m.
|
[5th Invited Talk] Laura Leal-Taixé
|
🔗 |
Fri 5:00 a.m. - 5:10 a.m.
|
Coffee Break
|
🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
PatchRot: A Self-Supervised Technique for Training Vision Transformers
(
[2nd] Poster Session
)
Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. However, labeling a huge dataset is a very expensive process. Self-supervised learning techniques alleviate this problem by learning features similar to supervised learning in an unsupervised way. In this paper, we propose a self-supervised technique PatchRot that is crafted for vision transformers. PatchRot rotates images and image patches and trains the network to predict the rotation angles. The network learns to extract both global and local features from an image. Our extensive experiments on different datasets showcase PatchRot training learns rich features which outperform supervised learning and compared baseline. |
Sachin Chhabra · Prabal Bijoy Dutta · Hemanth Venkateswara · baoxin Li 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
(
[2nd] Poster Session
)
In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE. |
Stephen Liang · Jerry Mendel 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets
(
[2nd] Poster Session
)
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. |
Xiangyu Chen · Ying Qin · Wenju Xu · Andrés Bur · Cuncong Zhong · Guanghui Wang 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Learning Explicit Object-Centric Representations with Vision Transformers
(
[2nd] Poster Session
)
With the recent successful adaptation of transformers to the vision domain, particularly when trained in a self-supervised fashion, it has been shown that vision transformers can learn impressive object-reasoning-like behaviour and features expressive for the task of object segmentation in images. In this paper, we build on the self-supervision task of masked autoencoding and explore its effectiveness for explicitly learning object-centric representations with transformers. To this end, we design an object-centric autoencoder using transformers only and train it end-to-end to reconstruct full images from unmasked patches. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks. |
Oscar Vikström · Alexander Ilin 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
[2nd] Poster Session
|
🔗 |
Fri 5:50 a.m. - 6:00 a.m.
|
Best Paper Announcement and Closing Remarks
|
🔗 |