Multilingual corpus

Non-English book data, at the source.

A multilingual physical catalog and a pipeline that processes any language or script — one supplier instead of stitching several together.

any language, any script — at native quality

The gap

Why non-English is scarce

Web crawls skew heavily English. High-quality non-English text — especially long-form books — is comparatively rare, and often lower quality where it exists.

Native book corpora are the strongest non-English pretraining source and the least digitized. Sourcing them physically, in-market, is the reliable way to get them — which is exactly what an EU bookseller with a multilingual catalog is positioned to do.

Languages

Coverage

French & European

Sourced directly from our EU inventory — French and other European languages at native quality.

Chinese, native-reviewed

Chinese titles processed with in-house native-speaker review through our Chinese subsidiary.

Any language

The pipeline OCRs and structures whatever language or script you need — not English-first.

Quality

Per-language quality

  1. Language detected per page, not assumed per book
  2. Every page quality-scored; the score ships with the text
  3. Native review on flagged non-Latin pages before delivery

See also book data for pretraining and provenance & licensing.

FAQ

Common questions

Why is non-English book data scarce?

Web crawls skew heavily English, so high-quality long-form non-English text is comparatively rare and often lower quality. Native book corpora are the strongest non-English source and the least digitized.

Which languages do you cover?

French and other European languages from our EU inventory, native-reviewed Chinese, and any language or script the pipeline OCRs and structures — not English-first.

Is language detected per book or per page?

Per page. Language is detected per page, not assumed per book, and every page is quality-scored.

How is non-Latin quality ensured?

Flagged non-Latin pages get native review before delivery.

Non-English tokens no one else has

Send a sample size and a target language; we'll return EANs to dedup against your corpus.

Run the overlap check