We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed ``super link'', as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.
Architecture. VisionLLM v2 consists of four parts: (1) An image encoder and a region encoder that encode the image-level and region-level information; (2) A large language model (LLM) that models the multimodal inputs and generates satisfactory textual responses; (3) A series of task-specific decoders for performing downstream tasks; (4) A super link that uses routing tokens and super-link queries for efficient and conflict-free information transmission.
Super Link Technique. The super link is proposed to tackle the challenge of selecting the appropriate decoder, avoiding task conflicts, and facilitating effective information transmission between the LLM and the decoders. It comprises of two parts: (1) Routing Tokens. We add the routing tokens, e.g., [DET], [SEG], [GEN], as the special tokens to the LLM vocabulary. They play as the trigger to select appropriate task-specific decoders when the model intends to complete the downstream tasks. (2) Super-Link Queries. For each decoder, we define the super-link queries as a fixed set of embeddings. The super-link queries would be automatically appended after the input embeddings of the routing token. Their corresponding hidden states serve as the condition for decoders.
@article{wu2024visionllmv2,
title={VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks},
author={Jiannan, Wu and Muyan, Zhong and Sen, Xing and Zeqiang, Lai and Zhaoyang, Liu and Zhe, Chen and Wenhai, Wang and Xizhou, Zhu and
Lewei, Lu and Tong, Lu and Ping, Luo and Yu, Qiao and Jifeng, Dai},
journal={arXiv preprint arXiv:2406.08394},
year={2024}
}