HyperAIHyperAI

Command Palette

Search for a command to run...

LCCC Large Clean Chinese Conversational Corpus

Date

2 years ago

Size

939.48 MB

Organization

Tsinghua University

Publish URL

github.com

Paper URL

arxiv.org

LCCC (Large-scale Cleaned Chinese Conversation corpus) was released by Tsinghua University and Samsung China Research Institute in 2020.

The dataset mainly consists of two parts: LCCC-base (6.8 million dialogues) and LCCC-large (12 million dialogues). The research team designed a strict data filtering process to ensure the quality of the dialogue data in the dataset. The process is based on a set of rules and a classifier trained on 110K manually annotated dialogue pairs. The noise filtered by the research team includes: dirty words, special characters, emoticons, grammatically incorrect sentences, and irrelevant dialogues in the context. The cleaned dataset and pre-trained model will promote the research of short text dialogue modeling.

LCCC.torrent
Seeding 2Downloading 0Completed 289Total Downloads 525
  • LCCC/
    • README.md
      1.38 KB
    • README.txt
      2.76 KB
      • data/
        • lccc.zip
          939.48 MB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LCCC Large Clean Chinese Conversational Corpus | Datasets | HyperAI