The goal of quality estimation (QE) is to measure the quality of a (machine) translation without having access to a reference translation. In this blog, we explain how the QE score is created and how to interpret it.
What is the difference between quality evaluation and quality estimation?
While the words are fairly similar and are often used interchangeably, in fact they refer to two fundamentally different processes, particularly in the context of machine translation:
What is the QE score based on?
The TAUS QE score is mostly based on the semantic similarity. To calculate this score we use sentence embedding vectors that represent the meaning of each segment in order to calculate how similar the source and target segments are. To achieve the maximum accuracy and language coverage, TAUS uses embeddings from multiple language models.
The trained model provides a QE score for each segment. The scores range from 0 to 1, and can be interpreted as follows:
Are the QE scores similar to the translation memory (TM) matches?
While the concept of a QE score bears some resemblance to a TM match score, especially in terms of the application, the underlying logic and interpretations of these scores diverge significantly:
How reliable is the QE score?
As the word “estimation” suggests, the QE score is an approximation. It means that the value provided by a QE model is subject to the context in which it will be used. With generic models, where vast multilingual training data is available, the model tries to learn the intrinsic mathematical representations of sentences in various languages. It then attempts to assign a score based on the similarity between two sentences, signifying their equivalence in meaning. When applying this in a post-editing workflow, human reviewers need to be aware of how the score range correlates with human judgment. This range can subsequently serve as a guide for interpretation, so whether 85% should be considered good or 90%.
Model customization offers the flexibility to tailor this score according to specific requirements and scenarios, which allows more adaptability and gives more certainty to the ranges. Read here how MotionPoint set out to reduce their post-editing effort for a specific customer.
What are the options for the QE score categorization?
TAUS can create custom models that are fine-tuned to a specific domain and language pair. The training data should be labeled, but the type and values of the labels can vary per use case. Labels can be discrete, such as "poor", "below average", "average", "good", "excellent", or 1, 2, 3, 4, or they can be continuous. While it is possible to train a single model to work for many language pairs or topics/domains, we have found that the best results are obtained by training custom models that are both topic-/domain- and language-pair specific, e.g., French-German for the Health domain.
Dace is a product and operations management professional with 15+ years of experience in the localization industry. Over the past 7 years, she has taken on various roles at TAUS ranging from account management to product and operations management. Since 2020 she is a member of the Executive Team and leads the strategic planning and business operations of a team of 20+ employees. She holds a Bachelor’s degree in Translation and Interpreting and a Master’s degree in Social and Cultural Anthropology.