Date

2 years ago

Size

9.02 GB

Tags

The dataset is Firefly-LLaMA2-Chinese project The incremental pre-training data totals about 22GB of text, mainly including open source data sets such as CLUE, ThucNews, CNews, COIG, Wikipedia, and ancient poems, prose, classical Chinese, etc. collected by the research team. The data distribution is shown in the figure below.

firefly-pretrain-dataset.torrent

Seeding 1Downloading 0Completed 167Total Downloads 261

firefly-pretrain-dataset/
- README.md
  1.04 KB
- README.txt
  2.09 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Download

Discuss on Discord

Date

2 years ago

Size

9.02 GB

Related Datasets

Arena-Write Writing Generation Evaluation Dataset

2 months ago

Nemotron Personas USA (USA) Personality Dataset

3 months ago

2.2 GB59

INFINITY-CHAT Real Open Question Answering Dataset

2 months ago

SSRB Semi-structured Data Natural Language Query Dataset

2 months ago

VenusBench-GD Cross-Platform Interface Understanding Dataset

a month ago

GroundingME Complex Scene Understanding Evaluation Dataset

23 days ago

LightOnOCR-mix-0126 Text Transcription Dataset

5 days ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Firefly Chinese Llama2 Incremental pre-training Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Firefly Chinese Llama2 Incremental pre-training Dataset

Related Datasets

Arena-Write Writing Generation Evaluation Dataset

Nemotron Personas USA (USA) Personality Dataset

INFINITY-CHAT Real Open Question Answering Dataset

SSRB Semi-structured Data Natural Language Query Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Firefly Chinese Llama2 Incremental pre-training Dataset

Related Datasets

Arena-Write Writing Generation Evaluation Dataset

Nemotron Personas USA (USA) Personality Dataset

INFINITY-CHAT Real Open Question Answering Dataset

SSRB Semi-structured Data Natural Language Query Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

Arena-Write Writing Generation Evaluation Dataset

Nemotron Personas USA (USA) Personality Dataset

INFINITY-CHAT Real Open Question Answering Dataset

SSRB Semi-structured Data Natural Language Query Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Related Datasets

Arena-Write Writing Generation Evaluation Dataset

Nemotron Personas USA (USA) Personality Dataset

INFINITY-CHAT Real Open Question Answering Dataset

SSRB Semi-structured Data Natural Language Query Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset