Timezone: »
Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.
Author Information
Wenhai Wang (The Chinese University of Hong Kong)
Zhe Chen (Nanjing University)
Xiaokang Chen (Peking University)
Jiannan Wu (University of Hong Kong)
Xizhou Zhu (Shanghai AI Laboratory)
Gang Zeng (Peking University)
Ping Luo (The University of Hong Kong)
Tong Lu (Nanjing University)
Jie Zhou (Tsinghua University)
Yu Qiao (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences)
Jifeng Dai (Tsinghua University)
More from the Same Authors
-
2021 : An Empirical Investigation of Representation Learning for Imitation »
Xin Chen, Cynthia · Sam Toyer · Cody Wild · Scott Emmons · Ian Fischer · Kuang-Huei Lee · Neel Alex · Steven Wang · Ping Luo · Stuart Russell · Pieter Abbeel · Rohin Shah -
2022 Poster: OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression »
Wanhua Li · Xiaoke Huang · Zheng Zhu · Yansong Tang · Xiu Li · Jie Zhou · Jiwen Lu -
2022 Poster: Compressible-composable NeRF via Rank-residual Decomposition »
Jiaxiang Tang · Xiaokang Chen · Jingbo Wang · Gang Zeng -
2022 Poster: P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting »
Ziyi Wang · Xumin Yu · Yongming Rao · Jie Zhou · Jiwen Lu -
2022 : SEM2: Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model »
Zeyu Gao · Yao Mu · Ruoyan Shen · Chen Chen · Yangang Ren · Jianyu Chen · Shengbo Li · Ping Luo · Yanfeng Lu -
2023 Poster: OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping »
Huijie Wang · Tianyu Li · Yang Li · Li Chen · Chonghao Sima · Zhenbo Liu · Bangjun Wang · Peijin Jia · Yuting Wang · Shengyin Jiang · Feng Wen · Hang Xu · Ping Luo · Junchi Yan · Wei Zhang · Hongyang Li -
2023 Poster: Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection »
Linyan Huang · Zhiqi Li · Chonghao Sima · Wenhai Wang · Jingdong Wang · Yu Qiao · Hongyang Li -
2023 Poster: RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths »
Zeyue Xue · Guanglu Song · Qiushan Guo · Boxiao Liu · Zhuofan Zong · Yu Liu · Ping Luo -
2023 Poster: Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection »
Haibao Yu · Yingjuan Tang · Enze Xie · Jilei Mao · Ping Luo · Zaiqing Nie -
2023 Poster: AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset »
Jiakang Yuan · Bo Zhang · Xiangchao Yan · Botian Shi · Tao Chen · Yikang LI · Yu Qiao -
2023 Poster: Networks are Slacking Off: Understanding Generalization Problem in Image Deraining »
Jinjin Gu · Xianzheng Ma · Xiangtao Kong · Yu Qiao · Chao Dong -
2023 Poster: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought »
Yao Mu · Qinglong Zhang · Mengkang Hu · Wenhai Wang · Mingyu Ding · Jun Jin · Bin Wang · Jifeng Dai · Yu Qiao · Ping Luo -
2023 Poster: Uncovering and Quantifying Social Biases in Code Generation »
Yan Liu · Xiaokang Chen · Yan Gao · Zhe Su · Fengji Zhang · Daoguang Zan · Jian-Guang Lou · Pin-Yu Chen · Tsung-Yi Ho -
2023 Poster: TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation »
Rongkun Zheng · Lu Qi · Xi Chen · Yi Wang · Kun Wang · Yu Qiao · Hengshuang Zhao -
2023 Poster: Real-World Image Super-Resolution as Multi-Task Learning »
Wenlong Zhang · Xiaohui Li · Guangyuan SHI · Xiangyu Chen · Yu Qiao · Xiaoyun Zhang · Xiao-Ming Wu · Chao Dong -
2023 Poster: Foundation Model is Efficient Multimodal Multitask Model Selector »
fanqing meng · Wenqi Shao · zhanglin peng · Chonghe Jiang · Kaipeng Zhang · Yu Qiao · Ping Luo -
2023 Poster: UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models »
Wenliang Zhao · Lujia Bai · Yongming Rao · Jie Zhou · Jiwen Lu -
2023 Poster: MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory »
Yinan Liang · Ziwei Wang · Xiuwei Xu · Yansong Tang · Jie Zhou · Jiwen Lu -
2023 Poster: JourneyDB: A Benchmark for Generative Image Understanding »
Keqiang Sun · Junting Pan · Yuying Ge · Hao Li · Haodong Duan · Xiaoshi Wu · Renrui Zhang · Aojun Zhou · Zipeng Qin · Yi Wang · Jifeng Dai · Yu Qiao · Limin Wang · Hongsheng Li -
2022 Workshop: Vision Transformers: Theory and applications »
Fahad Shahbaz Khan · Gul Varol · Salman Khan · Ping Luo · Rao Anwer · Ashish Vaswani · Hisham Cholakkal · Niki Parmar · Joost van de Weijer · Mubarak Shah -
2022 Spotlight: Lightning Talks 6B-3 »
Lingfeng Yang · Yao Lai · Zizheng Pan · Zhenyu Wang · Weicong Liang · Chuanyang Zheng · Jian-Wei Zhang · Peng Jin · Jing Liu · Xiuying Wei · Yao Mu · Xiang Li · YUHUI YUAN · Zizheng Pan · Yifan Sun · Yunchen Zhang · Jianfei Cai · Hao Luo · zheyang li · Jinfa Huang · Haoyu He · Yi Yang · Ping Luo · Fenglin Liu · Henghui Ding · Borui Zhao · Xiangguo Zhang · Kai Zhang · Pichao WANG · Bohan Zhuang · Wei Chen · Ruihao Gong · Zhi Yang · Xian Wu · Feng Ding · Jianfei Cai · Xiao Luo · Renjie Song · Weihong Lin · Jian Yang · Wenming Tan · Bohan Zhuang · Shanghang Zhang · Shen Ge · Fan Wang · Qi Zhang · Guoli Song · Jun Xiao · Hao Li · Ding Jia · David Clifton · Ye Ren · Fengwei Yu · Zheng Zhang · Jie Chen · Shiliang Pu · Xianglong Liu · Chao Zhang · Han Hu -
2022 Spotlight: P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting »
Ziyi Wang · Xumin Yu · Yongming Rao · Jie Zhou · Jiwen Lu -
2022 Spotlight: MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning »
Yao Lai · Yao Mu · Ping Luo -
2022 Spotlight: Lightning Talks 6A-1 »
Ziyi Wang · Nian Liu · Yaming Yang · Qilong Wang · Yuanxin Liu · Zongxin Yang · Yizhao Gao · Yanchen Deng · Dongze Lian · Nanyi Fei · Ziyu Guan · Xiao Wang · Shufeng Kong · Xumin Yu · Daquan Zhou · Yi Yang · Fandong Meng · Mingze Gao · Caihua Liu · Yongming Rao · Zheng Lin · Haoyu Lu · Zhe Wang · Jiashi Feng · Zhaolin Zhang · Deyu Bo · Xinchao Wang · Chuan Shi · Jiangnan Li · Jiangtao Xie · Jie Zhou · Zhiwu Lu · Wei Zhao · Bo An · Jiwen Lu · Peihua Li · Jian Pei · Hao Jiang · Cai Xu · Peng Fu · Qinghua Hu · Yijie Li · Weigang Lu · Yanan Cao · Jianbin Huang · Weiping Wang · Zhao Cao · Jie Zhou -
2022 Spotlight: DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning »
Yao Mu · Yuzheng Zhuang · Fei Ni · Bin Wang · Jianyu Chen · Jianye Hao · Ping Luo -
2022 Spotlight: Lightning Talks 5A-1 »
Yao Mu · Jin Zhang · Haoyi Niu · Rui Yang · Mingdong Wu · Ze Gong · Shubham Sharma · Chenjia Bai · Yu ("Tony") Zhang · Siyuan Li · Yuzheng Zhuang · Fangwei Zhong · Yiwen Qiu · Xiaoteng Ma · Fei Ni · Yulong Xia · Chongjie Zhang · Hao Dong · Ming Li · Zhaoran Wang · Bin Wang · Chongjie Zhang · Jianyu Chen · Guyue Zhou · Lei Han · Jianming HU · Jianye Hao · Xianyuan Zhan · Ping Luo -
2022 Spotlight: Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs »
Jinguo Zhu · Xizhou Zhu · Wenhai Wang · Xiaohua Wang · Hongsheng Li · Xiaogang Wang · Jifeng Dai -
2022 Spotlight: MCMAE: Masked Convolution Meets Masked Autoencoders »
Peng Gao · Teli Ma · Hongsheng Li · Ziyi Lin · Jifeng Dai · Yu Qiao -
2022 Spotlight: Compressible-composable NeRF via Rank-residual Decomposition »
Jiaxiang Tang · Xiaokang Chen · Jingbo Wang · Gang Zeng -
2022 Spotlight: Lightning Talks 2B-1 »
Yehui Tang · Jian Wang · Zheng Chen · man zhou · Peng Gao · Chenyang Si · SHANGKUN SUN · Yixing Xu · Weihao Yu · Xinghao Chen · Kai Han · Hu Yu · Yulun Zhang · Chenhui Gou · Teli Ma · Yuanqi Chen · Yunhe Wang · Hongsheng Li · Jinjin Gu · Jianyuan Guo · Qiman Wu · Pan Zhou · Yu Zhu · Jie Huang · Chang Xu · Yichen Zhou · Haocheng Feng · Guodong Guo · yongbing zhang · Ziyi Lin · Feng Zhao · Ge Li · Junyu Han · Jinwei Gu · Jifeng Dai · Chao Xu · Xinchao Wang · Linghe Kong · Shuicheng Yan · Yu Qiao · Chen Change Loy · Xin Yuan · Errui Ding · Yunhe Wang · Deyu Meng · Jingdong Wang · Chongyi Li -
2022 Poster: Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training »
Renrui Zhang · Ziyu Guo · Peng Gao · Rongyao Fang · Bin Zhao · Dong Wang · Yu Qiao · Hongsheng Li -
2022 Poster: Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs »
Jinguo Zhu · Xizhou Zhu · Wenhai Wang · Xiaohua Wang · Hongsheng Li · Xiaogang Wang · Jifeng Dai -
2022 Poster: DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning »
Yao Mu · Yuzheng Zhuang · Fei Ni · Bin Wang · Jianyu Chen · Jianye Hao · Ping Luo -
2022 Poster: AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition »
Shoufa Chen · Chongjian GE · Zhan Tong · Jiangliu Wang · Yibing Song · Jue Wang · Ping Luo -
2022 Poster: MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning »
Yao Lai · Yao Mu · Ping Luo -
2022 Poster: MCMAE: Masked Convolution Meets Masked Autoencoders »
Peng Gao · Teli Ma · Hongsheng Li · Ziyi Lin · Jifeng Dai · Yu Qiao -
2022 Poster: AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation »
Yuanfeng Ji · Haotian Bai · Chongjian GE · Jie Yang · Ye Zhu · Ruimao Zhang · Zhen Li · Lingyan Zhanng · Wanling Ma · Xiang Wan · Ping Luo -
2022 Poster: Rethinking Resolution in the Context of Efficient Video Recognition »
Chuofan Ma · Qiushan Guo · Yi Jiang · Ping Luo · Zehuan Yuan · Xiaojuan Qi -
2022 Poster: Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline »
Penghao Wu · Xiaosong Jia · Li Chen · Junchi Yan · Hongyang Li · Yu Qiao -
2022 Poster: HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions »
Yongming Rao · Wenliang Zhao · Yansong Tang · Jie Zhou · Ser Nam Lim · Jiwen Lu -
2022 Poster: Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes »
Zeyue Xue · Jianming Liang · Guanglu Song · Zhuofan Zong · Liang Chen · Yu Liu · Ping Luo -
2021 Poster: Rethinking the Pruning Criteria for Convolutional Neural Network »
Zhongzhan Huang · Wenqi Shao · Xinjiang Wang · Liang Lin · Ping Luo -
2021 Poster: Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language »
Mingyu Ding · Zhenfang Chen · Tao Du · Ping Luo · Josh Tenenbaum · Chuang Gan -
2021 Poster: Model-Based Reinforcement Learning via Imagination with Derived Memory »
Yao Mu · Yuzheng Zhuang · Bin Wang · Guangxiang Zhu · Wulong Liu · Jianyu Chen · Ping Luo · Shengbo Li · Chongjie Zhang · Jianye Hao -
2021 Poster: Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution »
Guangpin Tao · Xiaozhong Ji · Wenzhuo Wang · Shuo Chen · Chuming Lin · Yun Cao · Tong Lu · Donghao Luo · Ying Tai -
2021 Poster: DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification »
Yongming Rao · Wenliang Zhao · Benlin Liu · Jiwen Lu · Jie Zhou · Cho-Jui Hsieh -
2021 Poster: Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning »
Chongjian GE · Youwei Liang · YIBING SONG · Jianbo Jiao · Jue Wang · Ping Luo -
2021 Poster: Global Filter Networks for Image Classification »
Yongming Rao · Wenliang Zhao · Zheng Zhu · Jiwen Lu · Jie Zhou -
2021 Poster: Compressed Video Contrastive Learning »
Yuqi Huo · Mingyu Ding · Haoyu Lu · Nanyi Fei · Zhiwu Lu · Ji-Rong Wen · Ping Luo -
2021 Poster: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers »
Enze Xie · Wenhai Wang · Zhiding Yu · Anima Anandkumar · Jose M. Alvarez · Ping Luo -
2020 Poster: Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection »
Xiang Li · Wenhai Wang · Lijun Wu · Shuo Chen · Xiaolin Hu · Jun Li · Jinhui Tang · Jian Yang -
2017 Poster: Runtime Neural Pruning »
Ji Lin · Yongming Rao · Jiwen Lu · Jie Zhou -
2014 Poster: Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations »
Zhenyao Zhu · Ping Luo · Xiaogang Wang · Xiaoou Tang