Alibaba DAMO Academy recently launched SeaLLMs, a Large Language Model (LLM) designed specifically for Southeast Asia (SEA). These models, with versions featuring 13 billion and 7 billion parameters, are tailored to accommodate the region’s linguistic diversity, supporting languages like Tagalog, Vietnamese, Indonesian, Thai, Malay, Khmer, Lao, and Burmese.
Developed to resonate with the cultural nuances of each market, SeaLLM-chat models serve as adaptable and culturally sensitive chatbot assistants, aligning with local customs, styles, and legal frameworks. Now open-sourced on Hugging Face, these models are available for both research and commercial use.
“This innovation is set to hasten the democratization of AI (artificial intelligence), empowering communities historically underrepresented in the digital realm,” said Lidong Bing, director of the Language Technology Lab at Alibaba DAMO Academy.
DAMO Academy develops AI-based screening for pancreatic cancer
Generative AI, cloud security to play key roles in DX — Alibaba Cloud
SeaLLMs underwent meticulous pre-training on diverse datasets inclusive of SEA languages, enabling a deep understanding of local contexts and communication nuances. Their technical efficiency in processing non-Latin languages like Burmese, Khmer, Lao, and Thai sets SeaLLMs apart, allowing for more complex tasks, reduced computational costs, and a smaller environmental footprint compared to other models.
Foundational models
“This initiative has the potential to unlock new opportunities for millions who speak languages beyond English and Chinese,” said Luu Anh Tuan, assistant professor in the School of Computer Science and Engineering (SCSE) at Nanyang Technological University, a long-term partner of Alibaba in multi-language AI study. “Alibaba’s efforts in championing inclusive technology have now reached a milestone with SeaLLMs’ launch.”
The SeaLLM-13B model promises to outperform similar open-source models across linguistic, knowledge-related, and safety tasks. Evaluated through benchmarks like M3Exam and FLORES, SeaLLMs claim to have superior understanding in subjects ranging from science to economics in SEA languages, as well as excel in machine translation capabilities for low-resource languages.
This foundational groundwork sets the stage for SeaLLM-chat models, leveraging advanced fine-tuning techniques and a custom multilingual dataset. Chatbot assistants built on these models not only comprehend but also deeply respect and accurately portray the cultural nuances of these regional languages, according to DAMO Academy.
One standout technical advantage of SeaLLMs lies in their efficiency, especially with non-Latin languages.
“They can interpret and process text up to 9 times longer (or fewer tokens for the same text length) compared to other models like ChatGPT when dealing with languages such as Burmese, Khmer, Lao, and Thai,” the Academy said. This enhanced capacity translates into executing more complex tasks, diminishing operational and computational costs, and contributing to a reduced environmental footprint.