VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Jiannan Wu*2,1, Muyan Zhong*3, Sen Xing*3, Zeqiang Lai*4, Zhaoyang Liu*5,1, Zhe Chen*6,1, Wenhai Wang*1, Xizhou Zhu3,7,1, Lewei Lu7,1, Tong Lu6, Ping Luo2, Yu Qiao1, Jifeng Dai† 3,1
OpenGVLab, Shanghai AI Laboratory1, The University of Hong Kong2, Tsinghua University3,
Beijing Institute of Technology4, The Hong Kong University of Science and Technology5,
Nanjing University6, SenseTime Research7,

*Equal Contribution

Abstract

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed ``super link'', as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.

Teaser image.

VisionLLM v2

Architecture. VisionLLM v2 consists of four parts: (1) An image encoder and a region encoder that encode the image-level and region-level information; (2) A large language model (LLM) that models the multimodal inputs and generates satisfactory textual responses; (3) A series of task-specific decoders for performing downstream tasks; (4) A super link that uses routing tokens and super-link queries for efficient and conflict-free information transmission.

Super Link Technique. The super link is proposed to tackle the challenge of selecting the appropriate decoder, avoiding task conflicts, and facilitating effective information transmission between the LLM and the decoders. It comprises of two parts: (1) Routing Tokens. We add the routing tokens, e.g., [DET], [SEG], [GEN], as the special tokens to the LLM vocabulary. They play as the trigger to select appropriate task-specific decoders when the model intends to complete the downstream tasks. (2) Super-Link Queries. For each decoder, we define the super-link queries as a fixed set of embeddings. The super-link queries would be automatically appended after the input embeddings of the routing token. Their corresponding hidden states serve as the condition for decoders.

architecture

Qualitative Results

Multimodal Dialogue


Visual Perception

Qualitative Results

Visual Perception

Object Detection and Segmentation

Object Detection on Multiple Domains

Visual Grounding

Pose Estimation

Grounded Caption

Visual Generation

Image Genration

Image Editing

In-Context Ability

In-Context Visual Recognition

In-Context Captioning

In-Context Segmentation

In-Context Regional Perception

BibTeX


        @article{wu2024visionllmv2,
          title={VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks},
          author={Jiannan, Wu and Muyan, Zhong and Sen, Xing and Zeqiang, Lai and Zhaoyang, Liu and Zhe, Chen and Wenhai, Wang and Xizhou, Zhu and
          Lewei, Lu and Tong, Lu and Ping, Luo and Yu, Qiao and Jifeng, Dai},
          journal={arXiv preprint arXiv:2406.08394},
          year={2024}
        }