Workshop
Gaze Meets ML
Amarachi Blessing Mbakwe · Joy T Wu · Dario Zanca · Elizabeth Krupinski · Satyananda Kashyap · Alexandros Karargyris
Room 240 - 241
Eye gaze has proven to be a cost-efficient way to collect large-scale physiological data that can reveal the underlying human attentional patterns in real-life workflows, and thus has long been explored as a signal to directly measure human-related cognition in various domains. Physiological data (including but not limited to eye gaze) offer new perception capabilities, which could be used in several ML domains, e.g., egocentric perception, embodied AI, NLP, etc. They can help infer human perception, intentions, beliefs, goals, and other cognition properties that are much needed for human-AI interactions and agent coordination. In addition, large collections of eye-tracking data have enabled data-driven modeling of human visual attention mechanisms, both for saliency or scan path prediction, with twofold advantages: from the neuro-scientific perspective to understand biological mechanisms better, and from the AI perspective to equip agents with the ability to mimic or predict human behavior and improve interpretability and interactions.
The Gaze meets ML workshop aims at bringing together an active research community to collectively drive progress in defining and addressing core problems in gaze-assisted machine learning. This year the workshop will run its 2nd edition at NeurIPS again and it attracts a diverse group of researchers from academia and industry presenting novel works in this area of research.
Schedule
Sat 6:15 a.m. - 6:30 a.m.
|
Open Remarks
(
Opening Remarks
)
>
|
🔗 |
Sat 6:30 a.m. - 7:15 a.m.
|
Bertram Emil SHI (HKUST)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 7:15 a.m. - 8:00 a.m.
|
Vidhya Navalpakkam (Google) - Accelerating human attention research via smartphones
(
Invited Talk
)
>
SlidesLive Video Attention and eye movements are thought to be a window to the human mind, and have been extensively studied across Neuroscience, Psychology and HCI. However, progress in this area has been severely limited as the underlying methodology relies on specialized hardware that is expensive (upto $30,000) and hard to scale. In this talk, I will present our recent work from Google, which shows that ML applied to smartphone selfie cameras can enable accurate gaze estimation, comparable to state-of-the-art hardware based devices, at 1/100th the cost and without any additional hardware. Via extensive experiments, we show that our smartphone gaze tech can successfully replicate key findings from prior hardware-based eye movement research in Neuroscience and Psychology, across a variety of tasks including traditional oculomotor tasks, saliency analyses on natural images and reading comprehension. We also show that smartphone gaze could enable applications in improved health/wellness, for example, as a potential digital biomarker for detecting mental fatigue. These results show that smartphone-based attention has the potential to unlock advances by scaling eye movement research, and enabling new applications for improved health, wellness and accessibility, such as gaze-based interaction for patients with ALS/stroke that cannot otherwise interact with devices. |
Vidhya Navalpakkam 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Poster Session
(
Poster Session
)
>
|
🔗 |
Sat 8:30 a.m. - 8:45 a.m.
|
Interaction-aware Dynamic 3D Gaze Estimation in Videos
(
Oral
)
>
link
SlidesLive Video Human gaze in in-the-wild and outdoor human activities is a continuous and dynamic process that is driven by the anatomical eyemovements such as fixations, saccades and smooth pursuit. However, learning gaze dynamics in videos remains as a challenging task as annotating human gaze in videos is labor-expensive. In this paper, we propose a novel method for dynamic 3D gaze estimation in videos by utilizing the human interaction labels. Our model contains a temporal gaze estimator which is built upon Autoregressive Transformer structures. Besides, our model learns the spatial relationship of gaze among multiple subjects, by constructing a Human Interaction Graph from predicted gaze and update the gaze feature with a structure-aware Transformer. Our model predict future gaze conditioned on historical gaze and the gaze interactions in an autoregressive manner. We propose a multi-state training algorithm to alternatively update the Interaction module and dynamic gaze estimation module, when training on a mixture of labeled and unlabeled sequences. We show significant improvements in both within-domain gaze estimation accuracy and cross-domain generalization on the state-of-the-art physically unconstrainedin-the-wild Gaze360 gaze estimation benchmark. |
Chenyi Kuang · Jeffrey O Kephart · Qiang Ji 🔗 |
Sat 8:45 a.m. - 9:00 a.m.
|
SuperVision: Self-Supervised Super-Resolution for Appearance-Based Gaze Estimation
(
Oral
)
>
link
SlidesLive Video Gaze estimation is a valuable tool with a broad range of applications in various fields, including medicine, psychology, virtual reality, marketing, and safety. Therefore, it is essential to have gaze estimation software that is cost-efficient and high-performing. Accurately predicting gaze remains a difficult task, particularly in real-world situations where images are affected by motion blur, video compression, and noise. Super-resolution (SR) has been shown to remove these degradations and improve image quality from a visual perspective. This work examines the usefulness of super-resolution for improving appearance-based gaze estimation and demonstrates that not all SR models preserve the gaze direction. We propose a two-step framework for gaze estimation based on the SwinIR super-resolution model. The proposed method consistently outperforms the state-of-the-art, particularly in scenarios involving low-resolution or degraded images. Furthermore, we examine the use of super-resolution through the lens of self-supervised learning for gaze estimation and propose a novel architecture “SuperVision” by fusing an SR backbone network to a ResNet18. While only using 20\% of the data, the proposed SuperVision architecture outperforms the state-of-the-art GazeTR method by 15.5\%. |
Galen O'Shea · Majid Komeili 🔗 |
Sat 9:00 a.m. - 9:15 a.m.
|
EG-SIF: Improving Appearance Based Gaze Estimation using Self Improving Features
(
Oral
)
>
link
SlidesLive Video Gaze estimation is vital in various applications, but factors like poor lighting and lowresolution images challenge the performance of estimation model. We introduce, for thefirst time a Eye Gaze Estimation with Self-Improving Features (EG-SIF) method. EG-SIFsegregates images based on their quality, generates a pair of good and adverse images, andapplies multitask training with image enhancement using the generated pairs, where thetask is to reconstruct given a poor image. This innovative approach outperforms existingmethods, significantly improving gaze estimation angular error on challenging datasets likeMPIIGaze from 4.64 to 4.53 and in RTGene from 7.44 to 7.41. |
Vasudev Singh · Chaitanya Langde · Sourav Lakhotia · Vignesh Kannan · Shuaib Ahmed 🔗 |
Sat 9:15 a.m. - 9:30 a.m.
|
Planning by Active Sensing
(
Oral
)
>
link
SlidesLive Video Flexible behavior requires rapid planning, but planning requires a good internal model of the environment. Learning this model by trial-and-error is impractical when acting in complex environments. How do humans plan action sequences efficiently when there is uncertainty about model components? To address this, we asked human participants to navigate complex mazes in virtual reality. We found that the paths taken to gather rewards were close to optimal even though participants had no prior knowledge of these environments. Based on the sequential eye movement patterns observed when participants mentally compute a path before navigating, we develop an algorithm that is capable of rapidly planning under uncertainty by active sensing i.e., visually sampling information about the structure of the environment. ew eye movements are chosen in an iterative manner by following the gradient of a dynamic value map which is updated based on the previous eye movement, until the planning process reaches convergence. In addition to bearing hallmarks of human navigational planning, the proposed algorithm is sample-efficient such that the number of visual samples needed for planning scales linearly with the path length regardless of the size of the state space. |
Kaushik Lakshminarasimhan · Seren Zhu · Dora Angelaki 🔗 |
Sat 9:30 a.m. - 9:45 a.m.
|
Crafting Good Views of Medical Images for Contrastive Learning via Expert-level Visual Attention
(
Oral
)
>
link
SlidesLive Video Recent advancements in contrastive learning methods have shown significant improvements, which focus on minimizing the distances between different views of the same image.These methods typically craft two randomly augmented views of the same image as a positive pair, expecting the model to capture the inherent representation of the image. However, random data augmentation might not fully preserve image semantic information and can lead to a decline in the quality of the augmented views, thereby affecting the effectiveness of contrastive learning. This issue is particularly pronounced in the domain of medical images, where lesion areas can be subtle and are susceptible to distortion or removal.To address this issue, we leverage insights from radiologists' expertise in diagnosing medical images and propose Gaze-Conditioned Augmentation (GCA) to craft high-quality contrastive views of medical images given the radiologist's visual attention. Specifically, we track the gaze movements of radiologists and model their visual attention when reading to diagnose X-ray images. The learned model can predict visual attention of the radiologist when presented with a new X-ray image, and further guide the attention-aware augmentation, ensuring that it pays special attention to preserving disease-related abnormalities. Our proposed GCA can significantly improve the performance of contrastive learning methods on knee X-ray images, revealing its potential in medical applications. |
Sheng Wang · Zihao Zhao · Lichi Zhang · Dinggang Shen · Qian Wang 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Evaluating Peripheral Vision as an Input Transformation to Understand Object Detection Model Behavior
(
Poster
)
>
link
Incorporating aspects of human gaze into deep neural networks (DNNs) has been used to both improve and understand the representational properties of models. We extend this work by simulating peripheral vision -- a key component of human gaze -- in object detection DNNs. To do so, we modify a well-tested model of human peripheral vision (the Texture Tiling Model, TTM) to transform a subset of the MS-COCO dataset to mimic the information loss from peripheral vision. This transformed dataset enables us to (1) evaluate the performance of a variety of pre-trained DNNs on object detection in the periphery, (2) train a Faster-RCNN with peripheral vision input, and (3) test trained DNNs for corruption robustness. Our results show that stimulating peripheral vision helps us understand how different DNNs perform under constrained viewing conditions. In addition, we show that one benefit of training with peripheral vision is increased robustness to geometric and high severity image corruptions, but decreased robustness to noise-like corruptions. Altogether, our work makes it easier to model human peripheral vision in DNNs to understand both the role of peripheral vision in guiding gaze behavior and the benefits of human gaze in machine learning. |
Anne Harrington · Vasha DuTell · Mark Hamilton · Ayush Tewari · Simon Stent · Bill Freeman · Ruth Rosenholtz 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection
(
Poster
)
>
link
Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient ($\textit{attention-grabbing}$) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
|
Athul Mathew · Arshad Ali Khan · Thariq Khalid · Faroq AL-Tam · Riad Souissi 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Exploring Foveation and Saccade for Improved Weakly-Supervised Localization
(
Poster
)
>
link
Deep neural networks have become the de facto choice as feature extraction engines, ubiquitously used for computer vision tasks. The current approach is to process every input with uniform resolution in a one-shot manner and make all of the predictions at once. However, human vision is an "active" process that not only actively switches from one focus point to another within the visual field, but also applies spatially varying attention centered at such focus points. To bridge the gap, we propose incorporating the bio-plausible mechanisms of foveation and saccades to build an active object localization framework. While foveation enables it to process different regions of the input with variable degrees of detail, saccades allow it to change the focus point of such foveated regions. Our experiments show that these mechanisms improve the quality of predicted bounding boxes by capturing all the essential object parts while minimizing unnecessary background clutter. Additionally, they enable the resiliency of the method by allowing it to detect multiple objects while being trained only on data containing a single object per image. Finally, using the interesting "duck-rabbit" optical illusion, we show that our method manifests human-like behavior. |
Timur Ibrayev · Manish Nagaraj · Amitangshu Mukherjee · Kaushik Roy 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
SAM meets Gaze: Passive Eye Tracking for Prompt-based Instance Segmentation
(
Poster
)
>
link
The annotation of large new datasets for machine learning is a very time-consuming and expensive process. This is particularly true for pixel-accurate labelling of e.g. segmentation masks. Prompt-based methods have been developed to accelerate this label generation process by allowing the model to incorporate additional clues from other sources such as humans. The recently published Segment Anything foundation model (SAM) extends this approach by providing a flexible framework with a model that was trained on more than 1 billion segmentation masks, while also being able to exploit explicit user input. In this paper, we explore the usage of a passive eye tracking system to collect gaze data during unconstrained image inspections which we integrate as a novel prompt input for SAM. We evaluated our method on the original SAM model and finetuned the prompt encoder and mask decoder for different gaze-based inputs, namely fixation points, blurred gaze maps and multiple heatmap variants. Our results indicate that the acquisition of gaze data is faster than other prompt-based approaches while the segmentation performance stays comparable to the state-of-the-art performance of SAM. Code is available at https://XXXXXXXX. |
Daniel Beckmann · Jacqueline Kockwelp · Joerg Gromoll · Friedemann Kiefer · Benjamin Risse 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
GazeSAM: Interactive Image Segmentation with Eye Gaze and Segment Anything Model
(
Poster
)
>
link
Interactive image segmentation aims to assist users in efficiently generating high-quality data annotations through user-friendly interactions such as clicking, scribbling, and bounding boxes. However, mouse-based interaction methods can induce user fatigue during large-scale dataset annotation and are not entirely suitable for some domains, such as radiology. This study introduces eye gaze as a novel interactive prompt for image segmentation, different than previous model-based applications. Specifically, leveraging the real-time interactive prompting feature of the recently proposed Segment Anything Model (SAM), we present the GazeSAM system to enable users to collect target segmentation masks by simply looking at the region of interest. GazeSAM tracks users' eye gaze and utilizes it as the input prompt for SAM, generating target segmentation masks in real time. To our best knowledge, GazeSAM is the first work to combine eye gaze and SAM for interactive image segmentation. Experimental results demonstrate that GazeSAM can improve nearly 50\% efficiency in 2D natural image and 3D medical image segmentation tasks. The code is available in https://github.com/Anonymous. |
Bin Wang · Armstrong Aboah · Zheyuan Zhang · Hongyi Pan · Ulas Bagci 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
FoVAE: Reconstructive Foveation as a Self-Supervised Variational Inference Task for Visual Representation Learning
(
Poster
)
>
link
We present the first steps toward a model of visual representation learning driven by a self-supervised reconstructive foveation mechanism. Tasked with looking at one visual patch at a time while reconstructing the current patch, predicting the next patch, and reconstructing the full image after a set number of timesteps, FoVAE learns to reconstruct images from the MNIST and Omniglot datasets, while inferring high-level priors about the whole image. In line with theories of Bayesian predictive coding in the brain and prior work on human foveation biases, the model combines bottom-up input processing with top-down learned priors to reconstruct its input, choosing foveation targets that balance local feature predictability with global information gain. FoVAE is able to transfer its priors and foveation policy across datasets to reconstruct samples from untrained datasets in a zero-shot transfer-learning setting. By showing that robust and domain-general policies of generative inference and action-based information gathering emerge from simple biologically-plausible inductive biases, this work paves the way for further exploration of the role of foveation in visual representation learning. |
Ivan Vegner · Siddharth N · Leonidas Doumas 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Human-like multiple object tracking through occlusion via gaze-following
(
Poster
)
>
link
State-of-the art multiple object tracking (MOT) models have recently been shown to behave in qualitatively different ways from human observers. They exhibit superhuman performance for large numbers of targets and subhuman performance when targets disappear behind occluders. Here we investigate whether human gaze behavior can help explain differences in human and model behavior. Human subjects watched scenes with objects of various appearances. They tracked a designated subset of the objects, which moved continuously and frequently disappeared behind static black-bar occluders, reporting the designated objects at the end of each trial. We measured eye movements during tracking and tracking accuracy. We found that human gaze behavior is clearly guided by task-relevance: designated objects were preferentially fixated. We compared human performance to that of cognitive models inspired by state-of-the art MOT models with object slots, where each slot represents model's probabilistic belief about the location and appearance of one object. In our model, incoming observations are unambiguously assigned to slots using the Hungarian algorithm. Locations are tracked probabilistically (given the hard assignment) with one Kalman filter per slot. We equipped the computational models with a fovea, yielding high-precision observations at the center and low-precision observations in the periphery. We found that constraining models to follow the same gaze behavior as humans (imposing the human measured fixation sequences) yields best captures human behavioral phenomena. These results demonstrate the importance of gaze behavior, allowing the human visual system to optimally use its limited resources. |
Benjamin Peters · Eivinas Butkus · Nikolaus Kriegeskorte 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Inverting cognitive models with machine learning to infer preferences from fixations
(
Poster
)
>
link
Inferring an individual’s preferences from their observable behavior is a key step in the development of assistive decision-making technology. Although machine learning models such as neural networks could in principle be deployed toward this inference, a large amount of data is required to train such models. Here, we present an approach in which a cognitive model generates simulated data to augment limited human data. Using these data, we train a neural network to invert the model, making it possible to infer preferences from behavior. We show how this approach can be used to infer the value that people assign to food items from their eye movements when choosing between those items. We demonstrate first that neural networks can infer the latent preferences used by the model to generate simulated fixations, and second that simulated data can be beneficial in pretraining a network for predicting human-reported preferences from real fixations. Compared to inferring preferences from choice alone, this approach confers a slight improvement in predicting preferences and also allows prediction to take place prior to the choice being made. Overall, our results suggest that using a combination of neural networks and model-simulated training data is a promising approach for developing technology that infers human preferences. |
Evan Russek · Frederick Callaway · Tom Griffiths 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Detection of Drowsiness and Impending Microsleep from Eye Movements
(
Poster
)
>
link
Drowsiness is a contributing factor in an estimated 12% of all road traffic fatalities. It is known that drowsiness directly affects oculomotor control.We therefore investigate whether drowsiness can be detected based on eye movements. To this end, we develop deep neural sequence models that exploit a person's raw eye-gaze and eye-closure signals to detect drowsiness. We explore three measures of drowsiness ground truth: a widely-used sleepiness self-assessment, reaction time, and impending microsleep in the near future. We find that our sequence models are able to detect drowsiness and outperform a baseline processing established engineered features. We also find that the risk of a microsleep event in the near future can be predicted more accurately than the sleepiness self-assessment or the reaction time. Moreover, a model that has been trained on predicting microsleep also excels at predicting self-assessed sleepiness in a cross-task evaluation, which indicates that upcoming microsleep is a less noisy proxy of the drowsiness ground truth. We investigate the relative contribution of eye-closure and gaze information to the model's performance. In order to make the topic of drowsiness detection more accessible to the research community, we collect and share eye-gaze data with participants in baseline and sleep-deprived states. |
Silvia Makowski · Paul Prasse · Lena A. Jäger · Tobias Scheffer 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
StatTexNet: Evaluating the Importance of Statistical Parameters for Pyramid-Based Texture and Peripheral Vision Models
(
Poster
)
>
link
Peripheral vision plays an important role in human vision, directing where and when tomake saccades. Although human behavior in the periphery is well-predicted by pyramid-based texture models, these approaches rely on hand-picked image statistics that are stillinsufficient to capture a wide variety of textures. To develop a more principled approach tostatistic selection for texture-based models of peripheral vision, we develop a self-supervisedmachine learning model to determine what set of statistics are most important for repre-senting texture. Our model, which we call StatTexNet, uses contrastive learning to take alarge set of statistics and compress them to a smaller set that best represents texture fami-lies. We validate our method using depleted texture images where the constituent statisticsare already known. We then use StatTexNet to determine the most and least importantstatistics for natural (non-depleted) texture images using weight interpretability metrics,finding these to be consistent with previous psychophysical studies. Finally, we demonstratethat textures are most effectively synthesized with the statistics identified as important;we see noticeable deterioration when excluding the most important statistics, but minimaleffects when excluding least important. Overall, we develop a machine learning method ofselecting statistics that can be used to create better peripheral vision models. With thesebetter models, we can more effectively understand the effects of peripheral vision in humangaze. |
Christian Koevesdi · Vasha DuTell · Anne Harrington · Mark Hamilton · Bill Freeman · Ruth Rosenholtz 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Temporal Understanding of Gaze Communication with GazeTransformer
(
Poster
)
>
link
Gaze plays a crucial role in daily social interactions as it allows humans to communicate intentions effectively. We address the problem of temporal understanding of gaze communication in social videos in two stages. First, we develop GazeTransformer, an end-to-end module that infers atomic-level behaviours in a given frame. Second, we develop a temporal module that predicts event-level behaviours in a video using the inferred atomic-level behaviours. Compared to existing methods, GazeTransformer does not require human head and object locations as input. Instead, it identifies these locations in a parallel and end-to-end manner. In addition, it can predict the attended targets of all predicted humans and infer more atomic-level behaviours that cannot be handled simultaneously by previous approaches. We achieve promising performance on both atomic- and event-level prediction on the (M)VACATION dataset. Code will be available at https://github.com/gazetransformer/gazetransformer. |
Ryan Anthony de Belen · Gelareh Mohammadi · Arcot Sowmya 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Interaction-aware Dynamic 3D Gaze Estimation in Videos
(
Poster
)
>
link
Human gaze in in-the-wild and outdoor human activities is a continuous and dynamic process that is driven by the anatomical eyemovements such as fixations, saccades and smooth pursuit. However, learning gaze dynamics in videos remains as a challenging task as annotating human gaze in videos is labor-expensive. In this paper, we propose a novel method for dynamic 3D gaze estimation in videos by utilizing the human interaction labels. Our model contains a temporal gaze estimator which is built upon Autoregressive Transformer structures. Besides, our model learns the spatial relationship of gaze among multiple subjects, by constructing a Human Interaction Graph from predicted gaze and update the gaze feature with a structure-aware Transformer. Our model predict future gaze conditioned on historical gaze and the gaze interactions in an autoregressive manner. We propose a multi-state training algorithm to alternatively update the Interaction module and dynamic gaze estimation module, when training on a mixture of labeled and unlabeled sequences. We show significant improvements in both within-domain gaze estimation accuracy and cross-domain generalization on the state-of-the-art physically unconstrainedin-the-wild Gaze360 gaze estimation benchmark. |
Chenyi Kuang · Jeffrey O Kephart · Qiang Ji 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
SuperVision: Self-Supervised Super-Resolution for Appearance-Based Gaze Estimation
(
Poster
)
>
link
Gaze estimation is a valuable tool with a broad range of applications in various fields, including medicine, psychology, virtual reality, marketing, and safety. Therefore, it is essential to have gaze estimation software that is cost-efficient and high-performing. Accurately predicting gaze remains a difficult task, particularly in real-world situations where images are affected by motion blur, video compression, and noise. Super-resolution (SR) has been shown to remove these degradations and improve image quality from a visual perspective. This work examines the usefulness of super-resolution for improving appearance-based gaze estimation and demonstrates that not all SR models preserve the gaze direction. We propose a two-step framework for gaze estimation based on the SwinIR super-resolution model. The proposed method consistently outperforms the state-of-the-art, particularly in scenarios involving low-resolution or degraded images. Furthermore, we examine the use of super-resolution through the lens of self-supervised learning for gaze estimation and propose a novel architecture “SuperVision” by fusing an SR backbone network to a ResNet18. While only using 20\% of the data, the proposed SuperVision architecture outperforms the state-of-the-art GazeTR method by 15.5\%. |
Galen O'Shea · Majid Komeili 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
An Attention-based Predictive Agent for Handwritten Numeral/Alphabet Recognition via Generation
(
Poster
)
>
link
A number of attention-based models for either classification or generation of handwritten numerals/alphabets have been reported in the literature. However, generation and classification are done jointly in very few end-to-end models. We propose a predictive agent model that actively samples its visual environment via a sequence of glimpses. The attention is driven by the agent's sensory prediction (or generation) error. At each sampling instant, the model predicts the observation class and completes the partial sequence observed till that instant. It learns where and what to sample by jointly minimizing the classification and generation errors. Three variants of this model are evaluated for handwriting generation and recognition on images of handwritten numerals and alphabets from benchmark datasets. We show that the proposed model is more efficient in handwritten numeral/alphabet recognition than human participants in a recently published study as well as a highly-cited attention-based reinforcement model. This is the first known attention-based agent to interact with and learn end-to-end from images for recognition via generation, with high degree of accuracy and efficiency. |
Bonny Banerjee · Murchana Baruah 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
EG-SIF: Improving Appearance Based Gaze Estimation using Self Improving Features
(
Poster
)
>
link
Gaze estimation is vital in various applications, but factors like poor lighting and lowresolution images challenge the performance of estimation model. We introduce, for thefirst time a Eye Gaze Estimation with Self-Improving Features (EG-SIF) method. EG-SIFsegregates images based on their quality, generates a pair of good and adverse images, andapplies multitask training with image enhancement using the generated pairs, where thetask is to reconstruct given a poor image. This innovative approach outperforms existingmethods, significantly improving gaze estimation angular error on challenging datasets likeMPIIGaze from 4.64 to 4.53 and in RTGene from 7.44 to 7.41. |
Vasudev Singh · Chaitanya Langde · Sourav Lakhotia · Vignesh Kannan · Shuaib Ahmed 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Planning by Active Sensing
(
Poster
)
>
link
Flexible behavior requires rapid planning, but planning requires a good internal model of the environment. Learning this model by trial-and-error is impractical when acting in complex environments. How do humans plan action sequences efficiently when there is uncertainty about model components? To address this, we asked human participants to navigate complex mazes in virtual reality. We found that the paths taken to gather rewards were close to optimal even though participants had no prior knowledge of these environments. Based on the sequential eye movement patterns observed when participants mentally compute a path before navigating, we develop an algorithm that is capable of rapidly planning under uncertainty by active sensing i.e., visually sampling information about the structure of the environment. ew eye movements are chosen in an iterative manner by following the gradient of a dynamic value map which is updated based on the previous eye movement, until the planning process reaches convergence. In addition to bearing hallmarks of human navigational planning, the proposed algorithm is sample-efficient such that the number of visual samples needed for planning scales linearly with the path length regardless of the size of the state space. |
Kaushik Lakshminarasimhan · Seren Zhu · Dora Angelaki 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Crafting Good Views of Medical Images for Contrastive Learning via Expert-level Visual Attention
(
Poster
)
>
link
Recent advancements in contrastive learning methods have shown significant improvements, which focus on minimizing the distances between different views of the same image.These methods typically craft two randomly augmented views of the same image as a positive pair, expecting the model to capture the inherent representation of the image. However, random data augmentation might not fully preserve image semantic information and can lead to a decline in the quality of the augmented views, thereby affecting the effectiveness of contrastive learning. This issue is particularly pronounced in the domain of medical images, where lesion areas can be subtle and are susceptible to distortion or removal.To address this issue, we leverage insights from radiologists' expertise in diagnosing medical images and propose Gaze-Conditioned Augmentation (GCA) to craft high-quality contrastive views of medical images given the radiologist's visual attention. Specifically, we track the gaze movements of radiologists and model their visual attention when reading to diagnose X-ray images. The learned model can predict visual attention of the radiologist when presented with a new X-ray image, and further guide the attention-aware augmentation, ensuring that it pays special attention to preserving disease-related abnormalities. Our proposed GCA can significantly improve the performance of contrastive learning methods on knee X-ray images, revealing its potential in medical applications. |
Sheng Wang · Zihao Zhao · Lichi Zhang · Dinggang Shen · Qian Wang 🔗 |
Sat 9:45 a.m. - 11:30 a.m.
|
Memory-Based Sequential Attention
(
Poster
)
>
link
Computational models of sequential attention often use recurrent neural networks, which may lead to information loss over accumulated glimpses and an inability to dynamically reweigh glimpses at each step. Addressing the former limitation should result in greater performance, while addressing the latter should enable greater interpretability. In this work, we propose a biologically-inspired model of sequential attention for image classification. Specifically, our algorithm contextualizes the history of observed locations from within an image to inform future gaze points, akin to scanpaths in the biological visual system. We achieve this by using a transformer-based memory module coupled with a reinforcement learning-based learning algorithm, improving both task performance and model interpretability. In addition to empirically evaluating our approach on classical vision tasks, we demonstrate the robustness of our algorithm to different initial locations in the image and provide interpretations of sampled locations from within the trajectory. |
Jason Stock · Charles Anderson 🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 12:15 p.m.
|
Tim Rolff (Universität Hamburg) - Gazing into the Crystal Ball: Predicting Future Gaze Events in Virtual Reality
(
Invited Talk
)
>
SlidesLive Video Accurately predicting human egocentric gaze events, such as saccades, fixations, and blinks, holds transformative potential for virtual reality (VR) applications. By using eye-trackers integrated into wearable head-mounted displays, it is possible to optimize runtime, improve user experience, and integrate gaze into downstream tasks. However, predicting gaze events remains challenging as the temporal dynamics of egocentric gaze events are multifaceted, for example, influenced by visual stimuli and task demands. While deep learning and machine learning offer promising avenues, the investigation of these approaches in gaze event prediction remains largely uncharted. In this talk, we present recent advances in recurrent time-to-event analysis for gaze event prediction, addressing the pressing challenges of temporal modeling and real-time prediction. This talk will discuss the potential implications, challenges, and techniques of accurate gaze event prediction in the context of egocentric vision with a focus on VR. |
🔗 |
Sat 12:15 p.m. - 12:30 p.m.
|
Memory-Based Sequential Attention
(
Oral
)
>
link
SlidesLive Video Computational models of sequential attention often use recurrent neural networks, which may lead to information loss over accumulated glimpses and an inability to dynamically reweigh glimpses at each step. Addressing the former limitation should result in greater performance, while addressing the latter should enable greater interpretability. In this work, we propose a biologically-inspired model of sequential attention for image classification. Specifically, our algorithm contextualizes the history of observed locations from within an image to inform future gaze points, akin to scanpaths in the biological visual system. We achieve this by using a transformer-based memory module coupled with a reinforcement learning-based learning algorithm, improving both task performance and model interpretability. In addition to empirically evaluating our approach on classical vision tasks, we demonstrate the robustness of our algorithm to different initial locations in the image and provide interpretations of sampled locations from within the trajectory. |
Jason Stock · Charles Anderson 🔗 |
Sat 12:30 p.m. - 12:45 p.m.
|
An Attention-based Predictive Agent for Handwritten Numeral/Alphabet Recognition via Generation
(
Oral
)
>
link
SlidesLive Video A number of attention-based models for either classification or generation of handwritten numerals/alphabets have been reported in the literature. However, generation and classification are done jointly in very few end-to-end models. We propose a predictive agent model that actively samples its visual environment via a sequence of glimpses. The attention is driven by the agent's sensory prediction (or generation) error. At each sampling instant, the model predicts the observation class and completes the partial sequence observed till that instant. It learns where and what to sample by jointly minimizing the classification and generation errors. Three variants of this model are evaluated for handwriting generation and recognition on images of handwritten numerals and alphabets from benchmark datasets. We show that the proposed model is more efficient in handwritten numeral/alphabet recognition than human participants in a recently published study as well as a highly-cited attention-based reinforcement model. This is the first known attention-based agent to interact with and learn end-to-end from images for recognition via generation, with high degree of accuracy and efficiency. |
Bonny Banerjee · Murchana Baruah 🔗 |
Sat 12:45 p.m. - 1:00 p.m.
|
Breakout Session (Instructions)
(
Break out session
)
>
|
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 1:00 p.m. - 2:30 p.m.
|
Breakout session
(
Breakout session
)
>
|
🔗 |
Sat 2:30 p.m. - 2:45 p.m.
|
Sponsors Talk and Award Ceremony
(
Sponsors
)
>
SlidesLive Video |
🔗 |
Sat 2:45 p.m. - 3:00 p.m.
|
Closing Remarks
(
Closing remarks
)
>
SlidesLive Video |
🔗 |