Skip to content

impresso/ocr-robust-multilingual-embeddings

Repository files navigation

OCR-Robust Multilingual Embeddings — Datasets, Resources, and Adapted Models

ACL 2025 Vienna LREC 2026 License: AGPLv3+

This repository hosts the resources for two related papers on multilingual embeddings that remain robust under OCR noise:

You’ll find fine-tuned models, noisy evaluation and training datasets, and utilities for simulating character-level OCR noise.


Table of Contents


Repository Structure

├── clean_evaluation_datasets/
│   └── ACL/                        # Clean CLSD (WMT19/21) and X-STS files
├── noisy_evaluation_datasets/
│   └── ACL/                        # OCR-degraded CLSD variants (MN, BL/SD, SnP)
├── noisy_finetuning_data/
│   ├── ACL/                        # TED and X-News with random character noise
│   └── LREC/                       # Historical newspaper + MLSum noised data,
│                                   #   plus Luxembourgish parallel pairs (JSONL)
├── generate_random_character_noise/
│   └── Script for stochastic character-level noise generation.
├── ocr_simulator/
│   └── Library for inducing realistic OCR noise in texts.
├── adapt_model.py                  # Fine-tuning / adaptation procedure
├── evaluate_embedding_model.py     # Evaluation of embedding models
├── sample_training.ipynb           # Example training notebook
├── sample_config.json              # Example training configuration
└── requirements.txt

Models

Our OCR-robust models are available on Hugging Face:

Paper Model
ACL 2025 impresso-project/OCR-robust-gte-multilingual-base
LREC 2026 impresso-project/OCR-diversely-robust-gte-multilingual-base

The ACL 2025 model was fine-tuned on TED with random character noise.


Datasets

Each dataset folder contains its own README with full file-level details (columns, row counts, CER levels, schema). The sections below give a high-level overview.

Clean Evaluation Datasets

Located in clean_evaluation_datasets/ACL/:

  • CLSD — Cross-Lingual Semantic Discrimination benchmark on WMT19/21 DE–FR news, used for retrieval-style Precision@1 evaluation. See Michail et al., 2025b.
  • X-STS — SemEval-2017 Task 1 cross-lingual STS pairs (AR–EN, EN–ES, ES–EN, TR–EN), used as a control task to check that OCR adaptation does not harm clean-text performance. See Cer et al., 2017.

Noisy Evaluation Datasets

Located in noisy_evaluation_datasets/ACL/:

  • Noisy CLSD — OCR-degraded CLSD variants at three noise levels: Minimal Noise (MN), Blackletter/Scanned Distorted (BL/SD), and Salt-and-Pepper (SnP). Generated by printing, visually degrading, and re-OCR-ing the clean CLSD texts with Tesseract.
  • HISTLUX bitext mining — Historical Luxembourgish articles (1840–1950) sentence-aligned to modern DE/EN/FR translations, evaluated as bitext mining. See Michail et al., 2025c.

Note: The six HISTLUX bitext mining JSONL files (350–425 MB each) exceed GitHub’s 100 MB limit and are not included in the repository. Download them from Google Drive and place them in noisy_evaluation_datasets/ACL/.

Fine-tuning Datasets

Located in noisy_finetuning_data/, split by paper:

  • ACL/TED2020 and X-News parallel DE–FR texts with random character noise at CER ≈ 4%, 5%, 10%, 15%. Includes a 10K stratified sample per corpus and a full concatenated TED corpus.
  • LREC/ — Resources for both stages of the LREC 2026 recipe:
    • Luxembourgish ↔ DE/EN/FR parallel sentence pairs from historical newspapers (Stage A)
    • DE and FR historical newspaper articles with random character noise, plus a noised MLSum sample (Stage B)

The Luxembourgish parallel data originates from impresso/histlux_emb (see prepared_training_sentences/) and is also available as HistLuxAlign on Hugging Face.


Reproducing the Experiments

  • Adaptation / fine-tuning: adapt_model.py
  • Evaluation: evaluate_embedding_model.py

Usage instructions are documented inline in each script.


Citation

If you use these resources, please cite:

@inproceedings{michail-etal-2025-cheap,
    title     = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
    author    = "Michail, Andrianos  and
                 Opitz, Juri  and
                 Wang, Yining  and
                 Meister, Robin  and
                 Sennrich, Rico  and
                 Clematide, Simon",
    editor    = "Che, Wanxiang  and
                 Nabende, Joyce  and
                 Shutova, Ekaterina  and
                 Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month     = jul,
    year      = "2025",
    address   = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.findings-acl.609/",
    pages     = "11705--11716",
    ISBN      = "979-8-89176-256-5"
}

% LREC 2026 citation — to be added

Support & Contributing

We plan to continue work on multilingual embedding models that are robust across a broader range of conditions. For contributions or access to material not yet released, contact andrianos.michail@cl.uzh.ch.


About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.


License

Copyright © 2025 The Impresso team. Released under the GNU Affero General Public License v3 or later.


Impresso Project Logo

About

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embeddings that are robust to OCR noise. All resources are publicly available and open-source.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors