Technology company Alibaba has released Wan2.2-S2V (Speech-to-Video), an open-source model that creates digital human videos by converting portrait photos into animated avatars.
The tool is part of Alibaba’s Wan2.2 video generation series and can generate videos from a single image and an audio clip. The Wan series has recorded more than 6.9 million downloads on Hugging Face and ModelScope.
Wan2.2-S2V supports different framing options such as portrait, bust, and full-body views. It can also generate character actions and environmental elements based on prompts. The model is designed for a range of use cases, including storytelling, design, and professional content creation.
With audio-driven animation technology, the model enables avatars to perform natural dialogue, singing, and other movements. It supports multiple characters in one scene and works with avatars ranging from realistic to cartoon-like styles.
Creators can export videos in 480P and 720P resolutions, making them suitable for social media and professional presentations. The model combines text-guided global motion control with audio-based local movements to deliver more natural performances, even in complex scenes.
Alibaba said the model uses a frame processing method that compresses historical frames into a compact form. This helps reduce computing needs and allows stable long-video generation. The research team also trained the model on a large audio-visual dataset, using a multi-resolution approach to support vertical, horizontal, and short-form video formats.
Wan2.2-S2V is available on Hugging Face, GitHub, and Alibaba Cloud’s open-source platform, ModelScope. Earlier this year, Alibaba open sourced Wan2.1 models in February and Wan2.2 models in July.

