L$^3$Seg: $\underline{\mathrm{L}}$ean $\underline{\mathrm{L}}$inear $\underline{\mathrm{L}}$ayers for Language-Guided Vision Transformer in Medical Image Segmentation
Rahul Bhardwaj · Utkarsh Tambe · Debanga Neog
Abstract
Vision-language models offer strong potential for medical image segmentation by integrating visual data with clinical text. However, these models typically involve large parameter counts and high computational cost, making them impractical for real-time use. This paper presents L$^3$Seg, a lightweight and efficient vision-language segmentation framework. The key component is the Lean Linear Layer (L$^3$), a linear projection that freezes pretrained weights and biases while learning only a small, token-dependent residual, parameterized into two low-rank matrices. Unlike conventional parameter-efficient methods, L$^3$ adapts each token representation with minimal extra computational cost. L$^3$Seg replaces all dense linear projections in the vision encoder and the vision-text fusion module with L$^3$, achieving state-of-the-art segmentation with only 8.2M parameters and 5.1GFLOPs. Experiments demonstrate consistent improvements across X-ray (QaTa-COV19), endoscopy (Kvasir-SEG), and ultrasound (BUSI), even with limited training data and sparse textual input. The source code is available at: https://github.com/bhardwaj-rahul-rb/l3seg
Chat is not available.
Successful Page Load