Proof-Pile-2 Mathematical Dataset
Date
Size

Proof-Pile-2 is a tokenized dataset of 55 billion math and science documents. It is a blend of scientific papers, math-related web content, and math code up to date as of April 2023 (excluding a specific subset of Lean proof steps). This dataset was created to train Llemma 7B and Llemma 34B models.
It consists of three subsets:
arxiv(29B tokens): RedPajama's ArXiv subsetopen-web-math(15B tokens):OpenWebMath A dataset containing many high-quality mathematical texts from the Internet.algebraic-stack(11B tokens): A new dataset of mathematical codes covering numerical computing, computer algebra, and formal mathematics.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.