ChineseWebText Chinese Web Text Dataset
Date
Size
ChineseWebText is the latest and largest Chinese dataset, containing 1.42 TB of data.Each text is assigned a quality score, making it easier for large language model researchers to select data based on new quality thresholds. A cleaner subset containing 600 GB of Chinese text with quality exceeding 90% is also released here. This directory contains the ChineseWebText dataset and the EvalWeb toolchain for processing CommonCrawl data.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.