Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ChatGPT is evaluated for machine translation
  • Candidate prompts generally work well
  • ChatGPT performs competitively with commercial translation products on high-resource European languages
  • ChatGPT lags behind significantly on low-resource or distant languages
  • ChatGPT does not perform as well as commercial systems on biomedical abstracts or Reddit comments

Paper Content

Introduction

  • ChatGPT is an intelligent chatting machine
  • It is trained to follow instructions and provide detailed responses
  • It can answer followup questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests
  • It can do various natural language processing tasks, including question answering, storytelling, logic reasoning, code debugging, and machine translation

Evaluation setting

  • Compared 3 commercial translation products
  • Evaluated on Flores-101, WMT19 Biomedical Translation Task, WMT20 Robustness Task
  • Sampled 50 sentences from each set for evaluation
  • Used BLEU score, ChrF++, and TER as metrics

Translation prompts

  • ChatGPT was asked to provide ten concise prompts or templates for machine translation
  • Three candidate prompts were summarized from the results, with an extra added to one of them
  • The three candidate prompts were compared on a Chinese-to-English translation task, with TP3 performing the best in terms of all three metrics

Multilingual translation

  • Four languages are evaluated: German, English, Romanian, and Chinese
  • 12 directions of translation are tested
  • German-English translation is considered a high-resource task
  • Romanian-English translation is considered a low-resource task
  • ChatGPT performs competitively for German-English translation
  • ChatGPT lags behind for Romanian-English translation
  • Translating between different language families is harder than within the same language family

Translation robustness

  • ChatGPT was evaluated on WMT19 Bio and WMT20 Rob2 and Rob3 test sets.
  • WMT19 Bio test set contains Medline abstracts, WMT20 Rob2 contains comments from reddit.com, and WMT20 Rob3 contains a crowdsourced speech recognition corpus.
  • ChatGPT does not perform as well as Google Translate or DeepL Translate on WMT19 Bio and WMT2 Rob2 test sets.

Conclusion

  • Studied ChatGPT for machine translation
  • ChatGPT performs competitively on high-resource European languages, but not low-resource or distant languages
  • ChatGPT not as good as commercial systems on biomedical abstracts or Reddit comments, but good for spoken language
  • Future work includes investigating impact of historical context and iterative refinement of translation