Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ChatGPT is evaluated for machine translation
- Candidate prompts generally work well
- ChatGPT performs competitively with commercial translation products on high-resource European languages
- ChatGPT lags behind significantly on low-resource or distant languages
- ChatGPT does not perform as well as commercial systems on biomedical abstracts or Reddit comments
Paper Content
Introduction
- ChatGPT is an intelligent chatting machine
- It is trained to follow instructions and provide detailed responses
- It can answer followup questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests
- It can do various natural language processing tasks, including question answering, storytelling, logic reasoning, code debugging, and machine translation
Evaluation setting
- Compared 3 commercial translation products
- Evaluated on Flores-101, WMT19 Biomedical Translation Task, WMT20 Robustness Task
- Sampled 50 sentences from each set for evaluation
- Used BLEU score, ChrF++, and TER as metrics
Translation prompts
- ChatGPT was asked to provide ten concise prompts or templates for machine translation
- Three candidate prompts were summarized from the results, with an extra added to one of them
- The three candidate prompts were compared on a Chinese-to-English translation task, with TP3 performing the best in terms of all three metrics
Multilingual translation
- Four languages are evaluated: German, English, Romanian, and Chinese
- 12 directions of translation are tested
- German-English translation is considered a high-resource task
- Romanian-English translation is considered a low-resource task
- ChatGPT performs competitively for German-English translation
- ChatGPT lags behind for Romanian-English translation
- Translating between different language families is harder than within the same language family
Translation robustness
- ChatGPT was evaluated on WMT19 Bio and WMT20 Rob2 and Rob3 test sets.
- WMT19 Bio test set contains Medline abstracts, WMT20 Rob2 contains comments from reddit.com, and WMT20 Rob3 contains a crowdsourced speech recognition corpus.
- ChatGPT does not perform as well as Google Translate or DeepL Translate on WMT19 Bio and WMT2 Rob2 test sets.
Conclusion
- Studied ChatGPT for machine translation
- ChatGPT performs competitively on high-resource European languages, but not low-resource or distant languages
- ChatGPT not as good as commercial systems on biomedical abstracts or Reddit comments, but good for spoken language
- Future work includes investigating impact of historical context and iterative refinement of translation