Alibaba Corporate Campus Xixi Hangzhou ChinaNews

Alibaba Cloud rolls out open-source vision language model

Alibaba Cloud, a cloud computing company, is now offering its latest Large Vision Language Model (LVLM), Qwen-VL, to developers and anyone interested in it.

In addition to Qwen-VL, Alibaba Cloud has introduced Qwen-VL-Chat, a conversationally fine-tuned model. These models are designed to accurately recognize images, texts, and bounding boxes in prompts, enabling multi-round question answering in both English and Chinese.

Qwen-VL is the multimodal version of Qwen-7B, Alibaba Cloud’s 7-billion-parameter model from its large language model Tongyi Qianwen (also available on ModelScope as open-source). Qwen-VL can understand both image inputs and text prompts in English and Chinese, allowing it to perform various tasks, including responding to open-ended queries related to different images and generating image captions.

Alibaba Cloud to open source large language models
Alibaba Cloud unveils AI image generation model

Qwen-VL-Chat is tailored for more complex interactions, such as comparing multiple image inputs and engaging in multi-round question answering. Leveraging alignment techniques, this AI assistant exhibits a wide range of creative capabilities, including writing poetry and stories based on input images, summarizing the content of multiple pictures, and solving mathematical questions displayed in images.

The Qwen-VL model underwent pre-training on image and text datasets. Compared to other open-source large vision language models capable of processing and understanding images in 224×224 resolution, Qwen-VL can handle image input at a resolution of 448×448, resulting in enhanced image recognition and comprehension.

Democratization of AI

As part of its commitment to democratizing AI technologies, Alibaba Cloud has shared the model’s code, weights, and documentation with academics, researchers, and commercial institutions worldwide. This contribution to the open-source community is accessible through Alibaba’s AI model community ModelScope and the collaborative AI platform Hugging Face. Companies with over 100 million monthly active users can request a license from Alibaba Cloud for commercial use.

“The introduction of these models, with their ability to extract meaning and information from images, holds the potential to revolutionize the interaction with visual content,” Alibaba Cloud said. 

For instance, leveraging its image comprehension and question-answering capability, the models could provide information assistance to visually impaired individuals during online shopping in the future.

Qwen-VL-Chat has also achieved leading results in both Chinese and English for text-image dialogue and alignment levels with humans, according to the benchmark test of Alibaba Cloud. This test involved over 300 images, 800 questions, and 27 categories.