This repository hosts the resources for two related papers on multilingual embeddings that remain robust under OCR noise:
- ACL 2025(findings) — Cheap Character Noise for OCR-Robust Multilingual Embeddings
- LREC 2026 — A Recipe for Adapting Multilingual Embedders to OCR-Error Robustness and Historical Texts — extends the denoising training to a wider set of languages and historical newspapers. (link to be updated)
You’ll find fine-tuned models, noisy evaluation and training datasets, and utilities for simulating character-level OCR noise.
- Repository Structure
- Models
- Datasets
- Reproducing the Experiments
- Citation
- Support & Contributing
- About Impresso
- License
├── clean_evaluation_datasets/
│ └── ACL/ # Clean CLSD (WMT19/21) and X-STS files
├── noisy_evaluation_datasets/
│ └── ACL/ # OCR-degraded CLSD variants (MN, BL/SD, SnP)
├── noisy_finetuning_data/
│ ├── ACL/ # TED and X-News with random character noise
│ └── LREC/ # Historical newspaper + MLSum noised data,
│ # plus Luxembourgish parallel pairs (JSONL)
├── generate_random_character_noise/
│ └── Script for stochastic character-level noise generation.
├── ocr_simulator/
│ └── Library for inducing realistic OCR noise in texts.
├── adapt_model.py # Fine-tuning / adaptation procedure
├── evaluate_embedding_model.py # Evaluation of embedding models
├── sample_training.ipynb # Example training notebook
├── sample_config.json # Example training configuration
└── requirements.txt
Our OCR-robust models are available on Hugging Face:
| Paper | Model |
|---|---|
| ACL 2025 | impresso-project/OCR-robust-gte-multilingual-base |
| LREC 2026 | impresso-project/OCR-diversely-robust-gte-multilingual-base |
The ACL 2025 model was fine-tuned on TED with random character noise.
Each dataset folder contains its own README with full file-level details (columns, row counts, CER levels, schema). The sections below give a high-level overview.
Located in clean_evaluation_datasets/ACL/:
- CLSD — Cross-Lingual Semantic Discrimination benchmark on WMT19/21 DE–FR news, used for retrieval-style Precision@1 evaluation. See Michail et al., 2025b.
- X-STS — SemEval-2017 Task 1 cross-lingual STS pairs (AR–EN, EN–ES, ES–EN, TR–EN), used as a control task to check that OCR adaptation does not harm clean-text performance. See Cer et al., 2017.
Located in noisy_evaluation_datasets/ACL/:
- Noisy CLSD — OCR-degraded CLSD variants at three noise levels: Minimal Noise (MN), Blackletter/Scanned Distorted (BL/SD), and Salt-and-Pepper (SnP). Generated by printing, visually degrading, and re-OCR-ing the clean CLSD texts with Tesseract.
- HISTLUX bitext mining — Historical Luxembourgish articles (1840–1950) sentence-aligned to modern DE/EN/FR translations, evaluated as bitext mining. See Michail et al., 2025c.
Note: The six HISTLUX bitext mining JSONL files (350–425 MB each) exceed GitHub’s 100 MB limit and are not included in the repository. Download them from Google Drive and place them in
noisy_evaluation_datasets/ACL/.
Located in noisy_finetuning_data/, split by paper:
ACL/— TED2020 and X-News parallel DE–FR texts with random character noise at CER ≈ 4%, 5%, 10%, 15%. Includes a 10K stratified sample per corpus and a full concatenated TED corpus.LREC/— Resources for both stages of the LREC 2026 recipe:- Luxembourgish ↔ DE/EN/FR parallel sentence pairs from historical newspapers (Stage A)
- DE and FR historical newspaper articles with random character noise, plus a noised MLSum sample (Stage B)
The Luxembourgish parallel data originates from impresso/histlux_emb (see prepared_training_sentences/) and is also available as HistLuxAlign on Hugging Face.
- Adaptation / fine-tuning:
adapt_model.py - Evaluation:
evaluate_embedding_model.py
Usage instructions are documented inline in each script.
If you use these resources, please cite:
@inproceedings{michail-etal-2025-cheap,
title = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
author = "Michail, Andrianos and
Opitz, Juri and
Wang, Yining and
Meister, Robin and
Sennrich, Rico and
Clematide, Simon",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.609/",
pages = "11705--11716",
ISBN = "979-8-89176-256-5"
}
% LREC 2026 citation — to be addedWe plan to continue work on multilingual embedding models that are robust across a broader range of conditions. For contributions or access to material not yet released, contact andrianos.michail@cl.uzh.ch.
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright © 2025 The Impresso team. Released under the GNU Affero General Public License v3 or later.
