Non-exclusive by default
License the same dataset as others, improved and re-OCR'd over time. The lowest-friction way to start.
Provenance & licensing
Every record traces back to a physical book we acquired and keep. Clear lineage, clear rights, terms that fit how you want to buy.
Where it comes from
We buy physical books in volume and keep the originals — we never discard the source.
Each shipped record traces back to a specific physical copy and its EAN. Because we retain the originals, any title can be re-checked against its source, and re-OCR'd later with a better engine without re-acquiring anything. Lineage is concrete, not asserted.
How you license it
License the same dataset as others, improved and re-OCR'd over time. The lowest-friction way to start.
Exclusive terms are available by language or segment when they make sense for both sides.
Because we keep the originals, a v2 of any dataset — better engine, richer structure — stays possible.
Verify first
Tell us a sample size and we'll send a list of EANs from our inventory. You check the overlap against your corpus on your side — whatever isn't already there is exactly what we'd ship. No data of yours leaves your side.
See also book data for pretraining and the multilingual corpus.
FAQ
Every shipped record traces back to a specific physical book we acquired and keep, with its EAN. Lineage is concrete, not asserted.
Yes. We retain the physical originals, so any title can be re-checked against its source or re-OCR'd later with a better engine — without re-acquiring anything.
Non-exclusive by default — the lowest-friction way to start. Exclusive terms are available by language or segment where they make sense for both sides.
Because we keep the originals, a v2 of any dataset — better engine, richer structure — stays possible.
Tell us how you want to license, and we'll send a sample of EANs to verify against your corpus.
Talk to us