Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Natural language processing disproportionately favors resource-rich languages.
  • Most modern language technologies are either nonexistent or unreliable for under-resourced languages.
  • OCR is used to convert endangered language documents into machine-readable data, but the output is often noisy.
  • Word alignment models are not built to work under noisy conditions.
  • This work studies existing word-level alignment models under noisy settings and aims to make them more robust.

Paper Content


  • OCR software is good at recognizing documents in high-resource languages
  • OCR is not as good at recognizing documents in less-resourced languages
  • Digitizing material in endangered languages can lead to the creation of NLP technologies
  • Most languages are traditionally oral, so ASR is needed to obtain textual data
  • Having good quality aligned data can improve OCR or ASR
  • Digitizing parallel documents can be beneficial for educational purposes
  • Alignment tools are brittle in the presence of noise
  • A probabilistic model was created to simulate OCR errors
  • 4,101 gold alignments were manually annotated for an endangered language pair
  • Structural knowledge and augmented data was used to reduce the alignment error rate

Problem setting

  • Our work is an extension of previous word-level alignment work.
  • We are given a sequence of words in a source language and a target language.
  • The starting data is the output of an OCR pipeline, which produces noisy parallel data.
  • Our goal is to produce an alignment that is as close to the alignment we would have obtained without the presence of noise.
  • We measure model performance using alignment error rate.


  • Created synthetic data to mimic OCR-like noise
  • Built probabilistic models based on edit distance measures to capture real OCR errors
  • Applied error-introducing model on clean parallel data to create synthetic OCR-like data
  • Used synthetic data to train/finetune alignment models

Ocr error modeling

  • OCR errors can be classified into three types: insertions, deletions, and substitutions.
  • Deletions and substitutions are more common than insertions.
  • Levenshtein distance is used to calculate the probability of errors in the corpus.

Data augmentation

  • Synthetically noised data can be created by using probability distributions from a clean corpus.
  • For each character, an error can be randomly introduced based on its error distribution.
  • Synthetic text is realistic and can be seen in Figure 1.

Model improvement

  • Synthetically noised parallel data can be used to train or finetune a word alignment model.
  • Unsupervised models like IBM translation models, fast-align, and Giza++ can be trained on concatenation of all available data.
  • Neural alignment model of Dou and Neubig (2021) is based on Multilingual BERT (mBERT).
  • Supervised and unsupervised finetuning can be used with the neural alignment model.
  • In low-resource scenarios, a diagonal bias can be introduced to improve model performance.

Languages and datasets

  • English-French language pair studied
  • English-German language pair studied
  • Griko-Italian language pair studied
  • Ainu-Japanese language pair studied

Dataset for error extraction

  • ICDAR 2019 Competition on Post-OCR Text Correction dataset provides clean and OCRed text for English-French and English-German.
  • 800 OCRed noisy and clean sentences for Griko and Ainu provided by Rijhwani et al. (2020).
  • Substitutions are the most common errors.
  • Griko and Ainu have seemingly lower scores than other high-resource languages due to high-quality scans.

Synthetic data

  • Synthetic data is created by applying OCR noise to clean text.
  • Clean text for English, French, and German comes from Europarl v8 corpus.
  • For Ainu, 816 clean sentences from Rijhwani et al. (2020) are used.

Test set and gold alignment

  • Test set and gold alignment for English-French come from Mihalcea and Pedersen (2003).
  • Test set for English-German comes from Europarl v7 corpus (Koehn, 2005) and gold alignments from Vilar et al. (2006).
  • Synthetically-noised test sets created for both language pairs by applying noise on one side or both.
  • Rijhwani et al. (2020) provide about 800 parallel sentence pairs for low-resource language pairs.
  • 4,101 gold word-level alignment pairs manually created for Griko-Italian test set.
  • Silver alignments obtained from awesome-align for Ainu-Japanese as no existing gold alignment available.


  • Method results in reducing AER
  • Multiple experiments conducted to prove this

Experimental setup

  • IBM model 1&2: classic statistical word alignment models
  • Underpinned many other statistical machine translation and word alignment models
  • Can measure how different results are when comparing alignments on clean vs noisy data

The effect of ocr-like noise

  • We use Griko-Italian as our main evaluation pair due to its gold alignments.
  • We compare the clean and OCRed versions of Griko-Italian against a manually created gold alignment.
  • Five different models are benchmarked.
  • Clean text always results in better alignment for all models.
  • Giza++ performs best among the models, but suffers the largest drop in performance with noisy text.
  • Awesome-align performs the worst, not being different than a simple IBM 1.
  • OCR error does impact alignment quality for both statistical and neural based alignment models.

Addtional data on statistical models

  • Training with additional data can help statistical models for endangered languages.
  • Evaluating model performance on Griko-Italian showed that training with additional clean text improved AER.
  • Training with additional noisy text hurt the models.
  • Statistical models rely on clean text to improve, which is often unavailable for endangered languages.

Analysis and discussion

  • Incorporating diagonal bias improves test cases for endangered language pairs
  • Attention score is increased significantly by adding bias
  • Diagonal bias is only applied in low-resource settings
  • Quantitative analyses show better alignment with more data
  • Higher CER leads to greater AER
  • Mixing with different degrees of CER produces better results than fixed CER
  • Giza++ and fast-align outperform vanilla neural awesome-align in monotone alignment
  • Awesome-align’s performance degrades more when both sides of parallel data are noisy
  • Word alignment research started with statistical models
  • Neural network based alignment models gaining in popularity
  • Previous works involve improving word-level alignment for low-resource languages
  • Augmenting training data is not new and has been applied in many areas
  • Applying structure alignment bias on statistical and neural models is well-studied


  • Benchmarking several popular word alignment models under OCR noisy settings with high-and low-resource language pairs
  • Proposed a simple yet effective approach to create realistic OCR-like synthetic data
  • Released 4,101 ground truth word alignment data for Griko-Italian
  • Evaluated four test sets on English-French and English-German for awesome-align
  • Created eight 100k English-French synthetic datasets with different CER
  • Calculated precision, recall and alignment error rate
  • No significant performance differences in most cases for unsupervised setting
  • High levels of noise hurt supervised finetuning approach
  • Introduced diagonal bias and applied it on the top of awesome-align’s attention layer
  • Synthetic data manage to mimic the real OCR noise
  • Overall, most models’ performance is close to the baseline