book.json
Metadata and parsed structure: chapters, sections, figures, footnotes.
Book data for pretraining
Paper-only books, OCR'd and structured into pretraining-ready bundles. Net-new tokens — text that has never entered a digital corpus, not the data every model already trained on.
The case
The open web is picked over and deduplicated to exhaustion. The marginal token that still moves a model is one it has never seen.
Long-form, edited, professionally written prose is the densest such source — and the part still trapped on paper is the part no crawler has reached. We supply exactly that: books we own physically, turned into clean text and structure, verifiable title by title against our inventory.
The bundle
Every book is a self-contained bundle — machine-readable end to end, consistent title to title.
book.jsonMetadata and parsed structure: chapters, sections, figures, footnotes.
pages.jsonlPer-page text, quality-scored, with language detected per page.
Front, back, spine, and inside flaps where present.
Throughput
More on lineage and licensing in provenance & licensing, and on non-English coverage in the multilingual corpus.
FAQ
It's OCR'd from paper-only books that have never entered a digital corpus — not in Common Crawl, archive.org, Google Books, or LibGen. That makes it text no model has trained on, verifiable title by title.
Each title is a self-contained bundle: book.json (metadata and parsed structure) and pages.jsonl (per-page, quality-scored text), plus cover images — the same shape every title, so it drops into a pretraining pipeline without bespoke parsing.
Every page is quality-scored and the score ships with the text, so you can filter on ingestion.
Yes. We send a sample of EANs and you dedup against your own corpus locally — nothing of yours ever leaves your side.
Send a sample size and we'll return a list of EANs to dedup against your own corpus.
Run the overlap check