Book data for pretraining

Book data for LLM pretraining.

Q: What makes this data net-new?

It is OCR'd from paper-only books that have never entered a digital corpus - not in Common Crawl, archive.org, Google Books, or LibGen. That makes it text no model has trained on, verifiable title by title.

Q: What format do you deliver?

Each title is a self-contained bundle: book.json (metadata and parsed structure) and pages.jsonl (per-page, quality-scored text), plus cover images - the same shape every title, so it drops into a pretraining pipeline without bespoke parsing.

Paper-only books, OCR'd and structured into pretraining-ready bundles. Net-new tokens — text that has never entered a digital corpus, not the data every model already trained on.

Run the overlap check See the format →

books → structured, pretraining-ready data

The case

Why book data

The open web is picked over and deduplicated to exhaustion. The marginal token that still moves a model is one it has never seen.

Long-form, edited, professionally written prose is the densest such source — and the part still trapped on paper is the part no crawler has reached. We supply exactly that: books we own physically, turned into clean text and structure, verifiable title by title against our inventory.

The bundle

What you get

Every book is a self-contained bundle — machine-readable end to end, consistent title to title.

`book.json`

Metadata and parsed structure: chapters, sections, figures, footnotes.

`pages.jsonl`

Per-page text, quality-scored, with language detected per page.

Cover images

Front, back, spine, and inside flaps where present.

Throughput

Scale & format

1,000+ books per day at target throughput
JSON + JSONL, the same shape every title — drops into a pretraining pipeline without bespoke parsing
Any language or script; per-page quality scores ship with the text

More on lineage and licensing in provenance & licensing, and on non-English coverage in the multilingual corpus.

FAQ

Common questions

What makes this data “net-new”?

It's OCR'd from paper-only books that have never entered a digital corpus — not in Common Crawl, archive.org, Google Books, or LibGen. That makes it text no model has trained on, verifiable title by title.

What format do you deliver?

Each title is a self-contained bundle: book.json (metadata and parsed structure) and pages.jsonl (per-page, quality-scored text), plus cover images — the same shape every title, so it drops into a pretraining pipeline without bespoke parsing.

How is text quality handled?

Every page is quality-scored and the score ships with the text, so you can filter on ingestion.

Can we check novelty before committing?

Yes. We send a sample of EANs and you dedup against your own corpus locally — nothing of yours ever leaves your side.

See what's net-new to your corpus

Send a sample size and we'll return a list of EANs to dedup against your own corpus.

Run the overlap check