Temporal Gaze Dynamics as Zero-Shot Prompts for Volumetric Medical Segmentation
Abstract
Guiding foundation models like SAM-2 for volumetric medical segmentation typically relies on inefficient manual prompts. We introduce a more efficient, multimodal approach using eye gaze—a continuous physiological time series—to steer the model's focus in a zero-shot manner. By fusing a user's temporal gaze stream with spatial image data, we enable dynamic, interactive 3D segmentation. Evaluating with SAM-2 and its medical variant, MedSAM-2, our gaze-based method proves significantly more time-efficient (e.g., 62 vs. 88 seconds per volume) than manual bounding boxes, with a modest accuracy trade-off. This work establishes a practical framework for incorporating human physiological signals into sequential, human-in-the-loop clinical tasks, paving the way for more intuitive AI interfaces.