Provenance & licensing

Where the data comes from — and how you license it.

Every record traces back to a physical book we acquired and keep. Clear lineage, clear rights, terms that fit how you want to buy.

physical originals, kept — lineage you can re-check

Where it comes from

Provenance

We buy physical books in volume and keep the originals — we never discard the source.

Each shipped record traces back to a specific physical copy and its EAN. Because we retain the originals, any title can be re-checked against its source, and re-OCR'd later with a better engine without re-acquiring anything. Lineage is concrete, not asserted.

How you license it

Rights & licensing

Non-exclusive by default

License the same dataset as others, improved and re-OCR'd over time. The lowest-friction way to start.

Exclusivity, where it fits

Exclusive terms are available by language or segment when they make sense for both sides.

Re-OCR optionality

Because we keep the originals, a v2 of any dataset — better engine, richer structure — stays possible.

Verify first

Check the overlap before you commit

Tell us a sample size and we'll send a list of EANs from our inventory. You check the overlap against your corpus on your side — whatever isn't already there is exactly what we'd ship. No data of yours leaves your side.

  1. You tell us a sample size
  2. We send EANs from our inventory
  3. You dedup against your corpus, locally

See also book data for pretraining and the multilingual corpus.

FAQ

Common questions

Where does the data come from?

Every shipped record traces back to a specific physical book we acquired and keep, with its EAN. Lineage is concrete, not asserted.

Do you keep the original books?

Yes. We retain the physical originals, so any title can be re-checked against its source or re-OCR'd later with a better engine — without re-acquiring anything.

What licensing terms are available?

Non-exclusive by default — the lowest-friction way to start. Exclusive terms are available by language or segment where they make sense for both sides.

Can a dataset be improved over time?

Because we keep the originals, a v2 of any dataset — better engine, richer structure — stays possible.

Clear data, clear terms

Tell us how you want to license, and we'll send a sample of EANs to verify against your corpus.

Talk to us