Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Natural language processing disproportionately favors resource-rich languages.
Most modern language technologies are either nonexistent or unreliable for under-resourced languages.
OCR is used to convert endangered language documents into machine-readable data, but the output is often noisy.
Word alignment models are not built to work under noisy conditions.
This work studies existing word-level alignment models under noisy settings and aims to make them more robust.

Paper Content

Introduction

OCR software is good at recognizing documents in high-resource languages
OCR is not as good at recognizing documents in less-resourced languages
Digitizing material in endangered languages can lead to the creation of NLP technologies
Most languages are traditionally oral, so ASR is needed to obtain textual data
Having good quality aligned data can improve OCR or ASR
Digitizing parallel documents can be beneficial for educational purposes
Alignment tools are brittle in the presence of noise
A probabilistic model was created to simulate OCR errors
4,101 gold alignments were manually annotated for an endangered language pair
Structural knowledge and augmented data was used to reduce the alignment error rate

Problem setting

Our work is an extension of previous word-level alignment work.
We are given a sequence of words in a source language and a target language.
The starting data is the output of an OCR pipeline, which produces noisy parallel data.
Our goal is to produce an alignment that is as close to the alignment we would have obtained without the presence of noise.
We measure model performance using alignment error rate.

Method

Created synthetic data to mimic OCR-like noise
Built probabilistic models based on edit distance measures to capture real OCR errors
Applied error-introducing model on clean parallel data to create synthetic OCR-like data
Used synthetic data to train/finetune alignment models

Ocr error modeling

OCR errors can be classified into three types: insertions, deletions, and substitutions.
Deletions and substitutions are more common than insertions.
Levenshtein distance is used to calculate the probability of errors in the corpus.

Data augmentation

Synthetically noised data can be created by using probability distributions from a clean corpus.
For each character, an error can be randomly introduced based on its error distribution.
Synthetic text is realistic and can be seen in Figure 1.

Model improvement

Synthetically noised parallel data can be used to train or finetune a word alignment model.
Unsupervised models like IBM translation models, fast-align, and Giza++ can be trained on concatenation of all available data.
Neural alignment model of Dou and Neubig (2021) is based on Multilingual BERT (mBERT).
Supervised and unsupervised finetuning can be used with the neural alignment model.
In low-resource scenarios, a diagonal bias can be introduced to improve model performance.

Languages and datasets

English-French language pair studied
English-German language pair studied
Griko-Italian language pair studied
Ainu-Japanese language pair studied

Dataset for error extraction

ICDAR 2019 Competition on Post-OCR Text Correction dataset provides clean and OCRed text for English-French and English-German.
800 OCRed noisy and clean sentences for Griko and Ainu provided by Rijhwani et al. (2020).
Substitutions are the most common errors.
Griko and Ainu have seemingly lower scores than other high-resource languages due to high-quality scans.

Synthetic data

Synthetic data is created by applying OCR noise to clean text.
Clean text for English, French, and German comes from Europarl v8 corpus.
For Ainu, 816 clean sentences from Rijhwani et al. (2020) are used.

Test set and gold alignment

Test set and gold alignment for English-French come from Mihalcea and Pedersen (2003).
Test set for English-German comes from Europarl v7 corpus (Koehn, 2005) and gold alignments from Vilar et al. (2006).
Synthetically-noised test sets created for both language pairs by applying noise on one side or both.
Rijhwani et al. (2020) provide about 800 parallel sentence pairs for low-resource language pairs.
4,101 gold word-level alignment pairs manually created for Griko-Italian test set.
Silver alignments obtained from awesome-align for Ainu-Japanese as no existing gold alignment available.

Experiments

Method results in reducing AER
Multiple experiments conducted to prove this

Experimental setup

IBM model 1&2: classic statistical word alignment models
Underpinned many other statistical machine translation and word alignment models
Can measure how different results are when comparing alignments on clean vs noisy data

The effect of ocr-like noise

We use Griko-Italian as our main evaluation pair due to its gold alignments.
We compare the clean and OCRed versions of Griko-Italian against a manually created gold alignment.
Five different models are benchmarked.
Clean text always results in better alignment for all models.
Giza++ performs best among the models, but suffers the largest drop in performance with noisy text.
Awesome-align performs the worst, not being different than a simple IBM 1.
OCR error does impact alignment quality for both statistical and neural based alignment models.

Addtional data on statistical models

Training with additional data can help statistical models for endangered languages.
Evaluating model performance on Griko-Italian showed that training with additional clean text improved AER.
Training with additional noisy text hurt the models.
Statistical models rely on clean text to improve, which is often unavailable for endangered languages.

Analysis and discussion

Incorporating diagonal bias improves test cases for endangered language pairs
Attention score is increased significantly by adding bias
Diagonal bias is only applied in low-resource settings
Quantitative analyses show better alignment with more data
Higher CER leads to greater AER
Mixing with different degrees of CER produces better results than fixed CER
Giza++ and fast-align outperform vanilla neural awesome-align in monotone alignment
Awesome-align’s performance degrades more when both sides of parallel data are noisy
Word alignment research started with statistical models
Neural network based alignment models gaining in popularity
Previous works involve improving word-level alignment for low-resource languages
Augmenting training data is not new and has been applied in many areas
Applying structure alignment bias on statistical and neural models is well-studied

Conclusion

Benchmarking several popular word alignment models under OCR noisy settings with high-and low-resource language pairs
Proposed a simple yet effective approach to create realistic OCR-like synthetic data
Released 4,101 ground truth word alignment data for Griko-Italian
Evaluated four test sets on English-French and English-German for awesome-align
Created eight 100k English-French synthetic datasets with different CER
Calculated precision, recall and alignment error rate
No significant performance differences in most cases for unsupervised setting
High levels of noise hurt supervised finetuning approach
Introduced diagonal bias and applied it on the top of awesome-align’s attention layer
Synthetic data manage to mimic the real OCR noise
Overall, most models’ performance is close to the baseline

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem setting#

Method#

Ocr error modeling#

Data augmentation#

Model improvement#

Languages and datasets#

Dataset for error extraction#

Synthetic data#

Test set and gold alignment#

Experiments#

Experimental setup#

The effect of ocr-like noise#

Addtional data on statistical models#

Analysis and discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Problem setting

Method

Ocr error modeling

Data augmentation

Model improvement

Languages and datasets

Dataset for error extraction

Synthetic data

Test set and gold alignment

Experiments

Experimental setup

The effect of ocr-like noise

Addtional data on statistical models

Analysis and discussion

Conclusion