The year 2020 is set to be the most difficult for most of us in decades. While the future of the translation industry after the COVID-19 pandemic depends on many factors, there is no doubt that its technologies are set to evolve radically. Indeed, in many ways, the machine translation (MT) journey is just beginning: the end of 2019 and the beginning of 2020 were full of fresh, eye-opening perspectives on tomorrow’s MT and other language technologies.
So, what trends should you keep an eye on in 2020?
We all know that context is the key to unlocking new potential for MT. Sentence-by-sentence MT, which is used almost everywhere in production these days, cannot always choose the best possible translation. This is because it does not have access to the previous or following (in case of document translation) sentences during the translation process. This is a real limitation, as context can contain critical information. For example, when translating a Spanish sentence “Estas son sus gafas (This is his/her/their glasses)” without having access to the previous context, it’s impossible for a human or a machine to choose the right translation of the word “sus”.
In academia, several studies on NMT have attempted to use document-level information to incorporate the context of the previous and in some cases following sentences into the translation process - keep an eye on Elena Voita’s work for example. But we can already see the rise of context-aware MT in industry: in 2019 DeepL started using previous sentences as context during production. This is most likely to evolve and spread from DeepL to its competitors, and then from consumer apps to business systems. I’m sure we will see more of this capability emerging in the course of 2020.
Quality estimation (QE) is a machine learning technology that automatically assigns a quality or risk assessment to MT output without having access to a human-generated reference translation. It’s been around for a while, but only a handful of companies are using or experimenting with QE in production environments. It’s very likely that in 2020, QE of machine translation quality will be productized at scale, and that we will see the rise of truly hybrid MT-QE systems.
It’s also very probable that MT users who can afford to either discard or correct sentences that were translated badly by MT will start seriously investing into QE to boost the overall quality of their MT output. This year, there are finally both relatively stable Open Source and commercial solutions which are significantly lowering the barrier to start experimenting with QE technology.
To quote Adam Bittlingmayer from Modelfront, one of the pioneers of QE commercialization: "Knowing when MT is risky or just wrong is the key to reliability, efficient hybrid translation, better evaluation, and more. Now OpenKiwi and ModelFront are making this tech available to all players, just as seq2seq and AutoML did for NMT."
Let’s face it: there is a huge elephant in the QE room. BLEU has lost our trust in the NMT era. It doesn’t correlate significantly with human judgement; nor does it account for the synonymity, paraphrasing and clause reordering that NMT is famous for; and it simply ignores both the meaning and structure of sentences. All previous attempts to bring a more meaningful but equally convenient (in terms of speed and ubiquity) metric as a new standard failed, so BLEU is still widely used as an evaluation and optimization function, not only in academia but also in many industrial applications.
However, this year expectations are high: the availability of pre-trained models (BERT, for example) and the moderate success in leveraging ideas from the quality estimation field are opening up pathways for a new generation of translation quality metrics. Unbabel, for example, is working on a new internal metric called “COMET”, which adapts the predictor-estimator architecture of the Open Source quality estimation framework OpenKiwi to train a neural reference-based sentence-level MT QE metric. COMET shows high-levels of correlation with various types of human judgements, such as Multidimensional Quality Metrics (MQM), edit-distance and direct assessment scores.
In 2019 there were a lot of attempts to train large and very deep Transformer-based models with billions of parameters, but it seems that this trend has now reversed. Those huge models very often turned out to be massively over-parameterized and can often do a much better job when they are seriously compressed in size.
One of the strategies to explore the trade-off between the model size and quality is knowledge distillation. The idea here is to first develop a large “teacher” model and then train and deploy a smaller “student” model, which would mimic the teacher’s behavior. Knowledge distillation has already attracted a lot of attention both in academia and industry and can be instrumental in reducing time and computational power for NMT inference.
Another very fresh and promising idea is the Lottery Ticket Hypothesis, articulated as: "dense, randomly-initialized, feed-forward networks containing subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations." This establishes a generalized heuristic foundation for new ways of reducing the size and complexity of deep learning models (including MT models) without the need for training an initial large network which is further distilled or pruned with existing methods. The development of the Lottery Ticket training is still at the research stage. However, it’s possible that by the end of 2020 we will see a number of production-ready frameworks that would significantly speed up MT training. It seems that the time of small and efficient neural networks is coming!
Data shift is a curse for all production machine learning systems, and MT is no exception. Data shift is a phenomenon that occurs when there is a discrepancy between the training data the MT was trained on and the data that the MT is presented for translation. There are solutions on the market which control the correctness and fluency of source texts, but there are no systems that can notify users (or a higher-level application) that there is a difference between the data that the model expects to see as its input and what it really sees.
As MT solutions are becoming more integrated and verticalized, we will see more examples of more complete source data analysis. For example, a solution that informs users or downstream applications that the MT can fail because of a problem with the source texts (e.g. poorly-written English, or texts from a different domain). Or it can automatically correct user input and select the MT configuration most suitable for the content for which the translation was requested.
One of the nice features of NMT is its ability to handle translation between more than one language pair and domain in a single system. Previous research has shown that multilingual training brings benefits to low-resources languages, and can be especially effective for similar languages. On the other hand, the accuracy of translation for language pairs with rich linguistic resources can be negatively impacted by a multilingual architecture.
These findings pave the way for:
(a) a more effective and efficient use of available linguistic resources, by increasing an average quality of the multilingual system when applied deterministically, and
(b) decreasing the number of models handled in production.
The same logic may be applied to multiple domains, and it’s very likely that in 2020 we will see more multilingual and multi-domain NMT configurations moving from labs to industry.
So far the vast majority of modern MT systems remain static once deployed in production. As a result, they can’t react quickly to the changing context. Also, with the growth of the client base, the number of MT models handled in production increases dramatically, as MT providers need to handle hundreds if not thousands of MT models for multiple language pairs, domains, content types, clients and brands.
It’s unlikely that a single technology could help solve those two problems, but we are already seeing a lot of effort towards what is called dynamic adaptation. For example, Translation pieces and other forms of leveraging translation memories can call up previously-seen translation examples and instruct the MT to use them explicitly during translation. Another example is instance weighting, which forces the MT system to give more attention to the most recent or closer-to-the-target-domain content in order to make translations more relevant to the use case in question. In 2020, we shall very probably see more examples of adaptive MT driven primarily by the industry.
This list is by no means exhaustive. For example, I’ve not mentioned the pre-trained language models, like BERT, BERT family and XLNet, in application to MT. There is a high chance that we will see some papers describing algorithms combining pre-trained and neural MT in a single system already at WMT and EMNLP 2020. However, with some exceptions BERT and the like have not demonstrated success in improving the quality of MT so far.
All these near-to-production efforts suggest that MT is moving to a new quality level - faster, easier to experiment with, more collaborative and interactive. In the post-COVID world, it will be as important as ever to stay on top of emerging technology trends in order to make the right investment decisions and keep your business in shape for the next round of global changes.
Maxim Khalilov is currently a head of R&D at Glovo, a Spanish on-demand courier service unicorn. Prior to that he was a director of applied artificial intelligence at Unbabel, a company disrupting the customer service market with machine translation and worked a product owner in data science at Booking.com responsible for exploitation, collection and exploitation of digital content for hospitality market. Maxim is also a co-founder of a Natural Language Processing company NLPPeople.com, has a Ph.D. from Polytechnic University of Catalonia (Barcelona, 2009), an MBA from IE Business School (Madrid, 2016) and is the author of more than 30 scientific publications.