tsuzumi: Advanced and Sovereign Japanese LLM
Abstract
We have developed a series of large language models called tsuzumi, Japanese LLMs built entirely from scratch. The latest model, tsuzumi 2, includes 28.6 billion parameters and is trained on more than 10T tokens of a carefully curated multilingual corpus with a strong emphasis on high-quality Japanese data. It demonstrates robust performance in instruction following, controllable reasoning that can be enabled or disabled, and domain-specific tasks. The tokenizer is designed to reflect the structure of Japanese grammar and vocabulary, which enables significantly improved compression efficiency for Japanese while also supporting strong performance in English and other languages. The model is well suited for enterprises and public organizations because it can be deployed on premise to securely handle highly sensitive user data.