Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Natural language processing disproportionately favors resource-rich languages.
- Most modern language technologies are either nonexistent or unreliable for under-resourced languages.
- OCR is used to convert endangered language documents into machine-readable data, but the output is often noisy.
- Word alignment models are not built to work under noisy conditions.
- This work studies existing word-level alignment models under noisy settings and aims to make them more robust.
Paper Content
Introduction
- OCR software is good at recognizing documents in high-resource languages
- OCR is not as good at recognizing documents in less-resourced languages
- Digitizing material in endangered languages can lead to the creation of NLP technologies
- Most languages are traditionally oral, so ASR is needed to obtain textual data
- Having good quality aligned data can improve OCR or ASR
- Digitizing parallel documents can be beneficial for educational purposes
- Alignment tools are brittle in the presence of noise
- A probabilistic model was created to simulate OCR errors
- 4,101 gold alignments were manually annotated for an endangered language pair
- Structural knowledge and augmented data was used to reduce the alignment error rate
Problem setting
- Our work is an extension of previous word-level alignment work.
- We are given a sequence of words in a source language and a target language.
- The starting data is the output of an OCR pipeline, which produces noisy parallel data.
- Our goal is to produce an alignment that is as close to the alignment we would have obtained without the presence of noise.
- We measure model performance using alignment error rate.
Method
- Created synthetic data to mimic OCR-like noise
- Built probabilistic models based on edit distance measures to capture real OCR errors
- Applied error-introducing model on clean parallel data to create synthetic OCR-like data
- Used synthetic data to train/finetune alignment models
Ocr error modeling
- OCR errors can be classified into three types: insertions, deletions, and substitutions.
- Deletions and substitutions are more common than insertions.
- Levenshtein distance is used to calculate the probability of errors in the corpus.
Data augmentation
- Synthetically noised data can be created by using probability distributions from a clean corpus.
- For each character, an error can be randomly introduced based on its error distribution.
- Synthetic text is realistic and can be seen in Figure 1.
Model improvement
- Synthetically noised parallel data can be used to train or finetune a word alignment model.
- Unsupervised models like IBM translation models, fast-align, and Giza++ can be trained on concatenation of all available data.
- Neural alignment model of Dou and Neubig (2021) is based on Multilingual BERT (mBERT).
- Supervised and unsupervised finetuning can be used with the neural alignment model.
- In low-resource scenarios, a diagonal bias can be introduced to improve model performance.
Languages and datasets
- English-French language pair studied
- English-German language pair studied
- Griko-Italian language pair studied
- Ainu-Japanese language pair studied
Dataset for error extraction
- ICDAR 2019 Competition on Post-OCR Text Correction dataset provides clean and OCRed text for English-French and English-German.
- 800 OCRed noisy and clean sentences for Griko and Ainu provided by Rijhwani et al. (2020).
- Substitutions are the most common errors.
- Griko and Ainu have seemingly lower scores than other high-resource languages due to high-quality scans.
Synthetic data
- Synthetic data is created by applying OCR noise to clean text.
- Clean text for English, French, and German comes from Europarl v8 corpus.
- For Ainu, 816 clean sentences from Rijhwani et al. (2020) are used.
Test set and gold alignment
- Test set and gold alignment for English-French come from Mihalcea and Pedersen (2003).
- Test set for English-German comes from Europarl v7 corpus (Koehn, 2005) and gold alignments from Vilar et al. (2006).
- Synthetically-noised test sets created for both language pairs by applying noise on one side or both.
- Rijhwani et al. (2020) provide about 800 parallel sentence pairs for low-resource language pairs.
- 4,101 gold word-level alignment pairs manually created for Griko-Italian test set.
- Silver alignments obtained from awesome-align for Ainu-Japanese as no existing gold alignment available.
Experiments
- Method results in reducing AER
- Multiple experiments conducted to prove this
Experimental setup
- IBM model 1&2: classic statistical word alignment models
- Underpinned many other statistical machine translation and word alignment models
- Can measure how different results are when comparing alignments on clean vs noisy data
The effect of ocr-like noise
- We use Griko-Italian as our main evaluation pair due to its gold alignments.
- We compare the clean and OCRed versions of Griko-Italian against a manually created gold alignment.
- Five different models are benchmarked.
- Clean text always results in better alignment for all models.
- Giza++ performs best among the models, but suffers the largest drop in performance with noisy text.
- Awesome-align performs the worst, not being different than a simple IBM 1.
- OCR error does impact alignment quality for both statistical and neural based alignment models.
Addtional data on statistical models
- Training with additional data can help statistical models for endangered languages.
- Evaluating model performance on Griko-Italian showed that training with additional clean text improved AER.
- Training with additional noisy text hurt the models.
- Statistical models rely on clean text to improve, which is often unavailable for endangered languages.
Analysis and discussion
- Incorporating diagonal bias improves test cases for endangered language pairs
- Attention score is increased significantly by adding bias
- Diagonal bias is only applied in low-resource settings
- Quantitative analyses show better alignment with more data
- Higher CER leads to greater AER
- Mixing with different degrees of CER produces better results than fixed CER
- Giza++ and fast-align outperform vanilla neural awesome-align in monotone alignment
- Awesome-align’s performance degrades more when both sides of parallel data are noisy
- Word alignment research started with statistical models
- Neural network based alignment models gaining in popularity
- Previous works involve improving word-level alignment for low-resource languages
- Augmenting training data is not new and has been applied in many areas
- Applying structure alignment bias on statistical and neural models is well-studied
Conclusion
- Benchmarking several popular word alignment models under OCR noisy settings with high-and low-resource language pairs
- Proposed a simple yet effective approach to create realistic OCR-like synthetic data
- Released 4,101 ground truth word alignment data for Griko-Italian
- Evaluated four test sets on English-French and English-German for awesome-align
- Created eight 100k English-French synthetic datasets with different CER
- Calculated precision, recall and alignment error rate
- No significant performance differences in most cases for unsupervised setting
- High levels of noise hurt supervised finetuning approach
- Introduced diagonal bias and applied it on the top of awesome-align’s attention layer
- Synthetic data manage to mimic the real OCR noise
- Overall, most models’ performance is close to the baseline