OCR-Robust Multilingual Embeddings — Datasets, Resources, and Adapted Models

This repository hosts the resources for two related papers on multilingual embeddings that remain robust under OCR noise:

ACL 2025(findings) — Cheap Character Noise for OCR-Robust Multilingual Embeddings
LREC 2026 — A Recipe for Adapting Multilingual Embedders to OCR-Error Robustness and Historical Texts — extends the denoising training to a wider set of languages and historical newspapers. (link to be updated)

You’ll find fine-tuned models, noisy evaluation and training datasets, and utilities for simulating character-level OCR noise.

Repository Structure

├── clean_evaluation_datasets/
│   └── ACL/                        # Clean CLSD (WMT19/21) and X-STS files
├── noisy_evaluation_datasets/
│   └── ACL/                        # OCR-degraded CLSD variants (MN, BL/SD, SnP)
├── noisy_finetuning_data/
│   ├── ACL/                        # TED and X-News with random character noise
│   └── LREC/                       # Historical newspaper + MLSum noised data,
│                                   #   plus Luxembourgish parallel pairs (JSONL)
├── generate_random_character_noise/
│   └── Script for stochastic character-level noise generation.
├── ocr_simulator/
│   └── Library for inducing realistic OCR noise in texts.
├── adapt_model.py                  # Fine-tuning / adaptation procedure
├── evaluate_embedding_model.py     # Evaluation of embedding models
├── sample_training.ipynb           # Example training notebook
├── sample_config.json              # Example training configuration
└── requirements.txt

Models

Our OCR-robust models are available on Hugging Face:

Paper	Model
ACL 2025	impresso-project/OCR-robust-gte-multilingual-base
LREC 2026	impresso-project/OCR-diversely-robust-gte-multilingual-base

The ACL 2025 model was fine-tuned on TED with random character noise.

Datasets

Each dataset folder contains its own README with full file-level details (columns, row counts, CER levels, schema). The sections below give a high-level overview.

Clean Evaluation Datasets

Located in clean_evaluation_datasets/ACL/:

CLSD — Cross-Lingual Semantic Discrimination benchmark on WMT19/21 DE–FR news, used for retrieval-style Precision@1 evaluation. See Michail et al., 2025b.
X-STS — SemEval-2017 Task 1 cross-lingual STS pairs (AR–EN, EN–ES, ES–EN, TR–EN), used as a control task to check that OCR adaptation does not harm clean-text performance. See Cer et al., 2017.

Noisy Evaluation Datasets

Located in noisy_evaluation_datasets/ACL/:

Noisy CLSD — OCR-degraded CLSD variants at three noise levels: Minimal Noise (MN), Blackletter/Scanned Distorted (BL/SD), and Salt-and-Pepper (SnP). Generated by printing, visually degrading, and re-OCR-ing the clean CLSD texts with Tesseract.
HISTLUX bitext mining — Historical Luxembourgish articles (1840–1950) sentence-aligned to modern DE/EN/FR translations, evaluated as bitext mining. See Michail et al., 2025c.

Note: The six HISTLUX bitext mining JSONL files (350–425 MB each) exceed GitHub’s 100 MB limit and are not included in the repository. Download them from Google Drive and place them in noisy_evaluation_datasets/ACL/.

Fine-tuning Datasets

Located in noisy_finetuning_data/, split by paper:

ACL/ — TED2020 and X-News parallel DE–FR texts with random character noise at CER ≈ 4%, 5%, 10%, 15%. Includes a 10K stratified sample per corpus and a full concatenated TED corpus.
LREC/ — Resources for both stages of the LREC 2026 recipe:
- Luxembourgish ↔ DE/EN/FR parallel sentence pairs from historical newspapers (Stage A)
- DE and FR historical newspaper articles with random character noise, plus a noised MLSum sample (Stage B)

The Luxembourgish parallel data originates from impresso/histlux_emb (see prepared_training_sentences/) and is also available as HistLuxAlign on Hugging Face.

Reproducing the Experiments

Adaptation / fine-tuning: adapt_model.py
Evaluation: evaluate_embedding_model.py

Usage instructions are documented inline in each script.

Citation

If you use these resources, please cite:

@inproceedings{michail-etal-2025-cheap,
    title     = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
    author    = "Michail, Andrianos  and
                 Opitz, Juri  and
                 Wang, Yining  and
                 Meister, Robin  and
                 Sennrich, Rico  and
                 Clematide, Simon",
    editor    = "Che, Wanxiang  and
                 Nabende, Joyce  and
                 Shutova, Ekaterina  and
                 Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month     = jul,
    year      = "2025",
    address   = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.findings-acl.609/",
    pages     = "11705--11716",
    ISBN      = "979-8-89176-256-5"
}

% LREC 2026 citation — to be added

Support & Contributing

We plan to continue work on multilingual embedding models that are robust across a broader range of conditions. For contributions or access to material not yet released, contact andrianos.michail@cl.uzh.ch.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-Robust Multilingual Embeddings — Datasets, Resources, and Adapted Models

Table of Contents

Repository Structure

Models

Datasets

Clean Evaluation Datasets

Noisy Evaluation Datasets

Fine-tuning Datasets

Reproducing the Experiments

Citation

Support & Contributing

About Impresso

Impresso project

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
clean_evaluation_datasets		clean_evaluation_datasets
generate_random_character_noise		generate_random_character_noise
noisy_evaluation_datasets		noisy_evaluation_datasets
noisy_finetuning_data		noisy_finetuning_data
ocr_simulator		ocr_simulator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adapt_model.py		adapt_model.py
evaluate_embedding_model.py		evaluate_embedding_model.py
requirements.txt		requirements.txt
sample_config.json		sample_config.json
sample_training.ipynb		sample_training.ipynb

Folders and files

Latest commit

History

Repository files navigation

OCR-Robust Multilingual Embeddings — Datasets, Resources, and Adapted Models

Table of Contents

Repository Structure

Models

Datasets

Clean Evaluation Datasets

Noisy Evaluation Datasets

Fine-tuning Datasets

Reproducing the Experiments

Citation

Support & Contributing

About Impresso

Impresso project

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages