How to Improve Automatic MT Quality Evaluation Metrics

The MT Evaluation Dilemma

The translation industry is adopting machine translation (MT) in increasing numbers. Yet a prerequisite for efficient adoption, the evaluation of MT output quality, remains a major challenge for all. Each year there are many research publications investigating novel approaches that seek ways to automatically calculate quality. The handful of techniques that have entered the industry over the years are commonly thought to be of limited use.

Human evaluation is relatively expensive, time consuming and prone to subjectivity. However, when done well, human evaluation is still felt to be more trustworthy than automated metrics. There are no specific best practices applied when undertaking MT quality evaluation and no reliable benchmarking data to enable cross industry comparisons of users’ performance is yet available. Furthermore, there is little open sharing of learning between industry and the research community.

Simple, Cheap and Fast...but not very Accurate

Automated metrics often assume a single correct output, because only in rare occasions resources are available to produce a handful of possibilities. The most commonly used metrics range from word error rate or edit distance computation to a myriad of string similarity comparisons, the latter including the famous BLEU and METEOR variants.

These metrics are simple, cheap and fast, but they are not very accurate. More importantly, they are rarely connected to the notions of quality that are relevant for the intended use of such translations. For a long time, the research community has been using such metrics on standard – often artificial – datasets for which human translations are available.

BLEU is by far the most popular option, despite having a number of well-known limitations, such as its low correlation with humans at the sentence-level and inability to handle similarity between synonyms. Very rarely, translations are also assessed manually to verify whether improvements according to such metrics are indeed observed.

Application-Oriented Metrics

Metrics used for quality assessment of MT tend to be more application-oriented, commonly using information derived from the post-editing of automatic translations (such as edit distance between the MT system output and its post-edited version or post-editing time). They are applied to relevant datasets, as opposed to artificial datasets for which reference translations are available.

Overall, metrics based on post-edited machine translations provide a good proxy for human judgements on quality, but in practice they are very limited: they cannot be used while the system is being developed, especially for the optimization of parameters in statistical approaches, where millions of sentences need to be scored quickly and multiple times, as the algorithm iterates over possible parameter values.

In addition, post-editing is only one of the possible ways in which automatic translations can be exploited. The use of raw MT for assimilation or gisting is becoming more popular, and thus using post-editing as a quality metric is not always appropriate.

Quality Estimation

The divergence between metrics used for MT system development and metrics used during production is far from ideal: MT systems should be developed and optimized against metrics that reflect real production needs. To bridge this gap, more advanced metrics are needed. Metrics that take the quality requirements at hand into account, but are still cheap and fast to run.

Such metrics can be useful to develop or improve MT systems and in production, in cases where manual evaluation is not feasible or needs to be minimized. These are “trained” metrics, that is, metrics that rely on relevant data at the time of design, as a way of learning how to address specific quality requirements. However, once trained, they can be applied to new data with the same quality requirements and without the need for human translations. These metrics are commonly denominated “quality estimation” metrics.

Significant research towards quality estimation metrics has been done in recent years: a general framework to build such metrics is available and can be customized to specific language pairs, text types and domains, and quality requirements. However, these metrics have only been tested in very narrow scenarios, for a couple of language pairs and datasets commonly used by the MT research community.

Data Scarcity...NOT!

Work in this area has been held back by the lack of availability of relevant data to train metrics. Relevant data consists of a fairly small number (1,000+) of examples with pairs consisting of source and translations (preferably at the sentence level) for which a quality assessment has already been performed. This quality assessment can take various forms: post-editing (the actual post-edited translations or statistics from the process, such as time measurement, logging of edits, or edit distance), accuracy/fluency judgements, error counts, Likert scores, etc.

This type of data is often abundant among providers and buyers of automatic translations, since they routinely need to assess translations for quality assurance. Research on better, reference-free automatic evaluation metrics would therefore greatly benefit from a closer relationship between industry and academia.

Benefits

At a first stage, data of the type mentioned above provided by the industry could be used to train a number of variants of quality estimation metrics using the existing framework. Industry collaborators could validate these metrics, for example, by direct comparison of their scores against those given by humans, or using them to select relevant data samples to be manually assessed (e.g. the cases estimated to have the lowest quality).

Feedback to researchers on the quality of the metrics and how they need to be further adapted to particular scenarios could also result in further improvements of such metrics. The benefits for the industry include better automatic metrics to support or minimize the need for human assessment, and potentially better MT systems.

Platforms and tools such as DQF can facilitate such a collaboration between industry and academia, by providing systematic ways of collecting and storing quality assessments (according to specific requirements for a given content type, audience, purpose, etc.) that can be directly used to train quality estimation metrics. Additionally, quality estimation metrics could be integrated into such platforms to support human evaluation.

Would you like to learn more about this topic, consider signing up to our MT user group!