Hyperion — Alexandria's data line

Books that were never digitized.

We OCR and structure our own paper-only books and ship them as pretraining-ready bundles — text that has never entered a digital corpus. Net-new tokens, not the data every model already trained on.

books → data · text that was never digitized

The pipeline

How it works

  1. Acquire

    We buy physical books in volume.

  2. Capture

    Every page and cover, photographed.

  3. OCR

    Text extracted in any language or script.

  4. Structure & score

    Chapters and footnotes parsed; every page quality-scored.

  5. Ship

    Delivered as a per-book bundle.

The bundle

What we ship

Every book is a self-contained bundle — machine-readable end to end, consistent title to title.

  • book.json Metadata and parsed structure: chapters, sections, figures, footnotes.
  • pages.jsonl Per-page text, quality-scored.
  • Cover images Front, back, spine, and inside flaps where present.

Our OCR and structuring handle any language and script — not just English. The bundle shape stays identical title to title.

book.json
{
  "ean": "9782253096337",
  "title": "Les Misérables",
  "author": "Victor Hugo",
  "language": "fr",
  "pages": 1488,
  "structure": { "volumes": 5, "chapters": 365, "footnotes": 211 },
  "quality": { "mean_page_score": 0.98, "flagged": 3 }
}
pages.jsonl
{"page": 31, "kind": "body", "text": "En 1815, M. Myriel était évêque de Digne…", "score": 0.99}
{"page": 32, "kind": "body", "text": "Quoique ce détail ne touche au fond…", "score": 0.98}
Illustrative sample — one book, structured.

The difference

Why we're different

Tokens no one else has

The open web is picked over. The marginal value for an LLM is text that was never digitized — and much of our inventory is paper-only, never in any digital corpus, let alone Common Crawl, archive.org, Google Books, or LibGen. Net-new to every model, and verifiable per title.

Provenance & licensing →

Any language, not English-first

Most book-data sources are English-tuned. Ours isn't. The inventory is multilingual and the pipeline processes whatever language you need — one supplier instead of stitching several together.

Multilingual corpus →

Structured, not just scanned

You get parseable structure, not a text dump — chapter-aware, quality-scored per page. The shape is consistent book to book, so it drops into a pretraining pipeline without bespoke parsing per source.

Book data for pretraining →

Overlap check

Verify the net-new before you commit

Tell us a sample size and we'll send a list of EANs from our inventory. You check the overlap against your corpus on your side — whatever isn't already there is exactly what we'd ship. No data of yours ever leaves your side.

dedup protocol local-only
1

You tell us a sample size

e.g. 25,000 EANs across the languages you care about.

2

We send EANs from our inventory

Identifiers only — no text leaves us until you ask for it.

3

You dedup against your corpus, locally

The non-overlap is your net-new set. that's what we ship

Start here

See what's net-new to your corpus

Send a sample size and we'll return a list of EANs to dedup against your own corpus — nothing of yours leaves your side.

Run the overlap check