Institutional Books 1.0 Book Dataset
Date
Paper URL
Institutional Books 1.0 is a growing corpus of public domain books to be released by Harvard University in 2025. The related paper results are:Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability".
The dataset consists of 983,004 public domain books in 254 languages, mainly published in the 19th and 20th centuries. The dataset has 242 billion tokens, 386 million pages of text, and is available in both original and post-processed OCR export formats.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.