Updesh Indic Synthetic Text Dataset
Updesh is an Indian language synthetic text dataset released by Microsoft in 2025 to facilitate post-training of Large Language Models (LLMs) for Indian languages.
The dataset contains 6,800,000 inference data and 2,100,000 generated data in the following languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, and Urdu.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.