The Future is Multilingual, Multimodal, Massive

After three years of travel restrictions, TAUS Massively Multilingual Conference made its way back to San Jose, California on 11-13 October 2022. TAUS Founder and CEO Jaap van der Meer gave the opening keynote addressing the changes the world has gone through in the last few years,- from bigger geolinguistic shifts, fragmentation of globalization, and rising populist movements in the world, to more migration than ever. Referencing the debate between Nicholas Ostler and Lane Greene at the 2013 TAUS event about whether English or MT was the new lingua franca, he then asked, based on all these changes, where is English today as a language. Without a doubt, the world is massively multilingual today. New ways of supporting this new world should be the focus of technology. Jaap asked “how do we support a world with total diversity and inclusion?” and answered, “we need a lot of data and humans in the loop.”

This opening speech summarized the main takeaway of the whole conference actually. But let’s dive deeper into what other themes and topics were highlighted and repeated often by the speakers and attendees.

Multilinguality and Scale

Before the discussions, to get everyone on the same page, we started with a mini-lecture on what everyone should know about encoders, decoders, transformer models, and other key concepts in the world of MT by John DeNero, Co-founder and Chief Scientist at Lilt and professor at UC Berkeley, who has a unique way of simplifying complex concepts. Following this learning session, he was joined on stage by Hany Hassan Awadalla (Microsoft), Paco Guzmán (Meta), and Macduff Hughes (Google). This panel highlighted multilinguality and scale in NMT and talked about the rise of single models that can translate 1000+ language pairs including dialects, styles, etc. This interrelated nicely with the Integrative AI models similar to massively multilingual MT models presented by Hany Hasan Awadalla. It was interesting to see the focus shift from language pair specific models to massively multilingual models.

DeMT™ Evaluate API

All these models keep coming into the MT scene but how do we evaluate their performance and understand how the datasets we use impact them? The next session was all about the DeMT™ Evaluate API, presented by Anderson Vaz (CTO at TAUS) and Achim Ruopp (Director at Polyglot Technology). TAUS takes a data-first approach to enhance MT engines and with TAUS DeMT™ Translate, MT engines are trained in real-time with highly clean and domain-specific datasets to generate improved outputs. Achim Ruopp shared that based on his independent analysis of the TAUS DeMT™ result, he concluded that TAUS DeMT™ improves the BLEU score for all language pairs by more than 10 points, or 25% on average and by more than 5 points at minimum over the worst performing engine, or 11% on average, and by one point at minimum over the best performing engine. Now with the DeMT™ Evaluate API, users can get essential information on the quality of a given translation at the segment level and because the models can be customized with users’ data, the results will match the organization for accuracy, tone, and authenticity.

Standardizing MT Evaluation Methods

This presentation was followed by a panel of quality evaluation experts: Markus Freitag (Google), Gretchen Markiewicz (Raytheon BBN), and Philipp Koehn (Meta AI), moderated by Olga Beregovaya (Smartling). The main message of this panel was how hard it is to standardize evaluation methods when even humans cannot agree on how to annotate an error. It might be classified as a morphology error for one person and a grammatical error for another. The goal is often to establish a correlation between automatic evaluation criteria and human judgment. Another issue that adds up to the challenge was said to be evaluating the sentences based on the given context rather than singularly. NLP systems keep improving but the panel questioned whether the crowdworkers are qualified enough to evaluate these outputs and touched upon the issue of bias that generates prejudiced models.

Multilinguality in Enterprise Solutions

On the business side, we listened to enterprise solutions by Watson Srivathsan (Amazon AI), Sebastian Stüker (Zoom) and Murali Nathan (Avery Dennison), each focusing on the multilinguality and expansion of machine translation/AI services into more languages.

Transformation and Scale

The second business panel featured Wayne Bourland (Dell), Zhenhui Chao (VMware), and Loïc Dufresne de Virel (Intel), moderated by Renato Beninatto (Nimdzi Insights). With 800+ platforms and technologies available in the industry, it’s challenging to decide on the request and solution match in big organizations. Panelists each explained the specific issues that they face: delivering content in languages they do not often work in, source analysis and process optimization, identifying use cases to contribute to with AI and emerging new use cases daily… For all of these challenges, they agreed that AI has great potential to improve. With new layers of complexities such as inclusivity in content added to the mix, they strive to connect with organization’s KPIs and sell AI to internal services at the same time. Wayne stated that their goal is to deliver scale to the company by transforming the existing content into another language rather than just translate.

Language Expansion into the Long Tail

The most repeated words throughout the two days were multilingual models, which inevitably brought us to the topic of language expansion and going beyond the traditional languages. In the panel conversation with Vedanuj Goswami (Meta), Simona Beccaletto (TAUS), and Casper Grathwohl (Oxford Languages), moderated by Gráinne Maycock (Acolad), the main themes were generating data in the long tail languages, the ethical and operational challenges of crowdsourcing such data and a look into Meta’s ongoing research project about scaling human-centered machine translation. The audience asked many questions about the crowdsourcing aspect of the language expansion projects. Simona provided detailed insights into the TAUS HLP Platform including the ethical and operational challenges of forming communities to do the data work and the types of tasks they perform on the platform.

Making Large Models Smaller

Next on stage were the NLP experts András Aponyi (TAUS), Adam Bittlingmayer (ModelFront), Sunil Mallya (Flip AI), moderated by JP Barraza (Systran). They discussed the concepts of transfer learning, transformers, low-code tools for NLP creation, multilingual NLP, combining supervised and unsupervised ML, training models with reinforcement learning, automating customer service, content moderation: social media monitoring, sentiment analysis, detecting fake news, misinformation and cyber-bullying. They highlighted the efforts around building a single model to translate a great number of languages, namely multilingual models. Sunil Mallya put it nicely by saying “our challenge now is to make large models smaller”. The discussions with the audience opened up the emergence of new modalities apart from only text and text data. Going beyond text, now the industry has to work on multimodal models and therefore a need for data in different modalities appears. Building models require a lot of resources as well as generating carbon emissions.

World-Readiness Contest

One of the exciting parts of the conference was the World-Readiness Contest in which the following solutions were presented: Konstantin Savenkov (Intento), Adam Bittlingmayer (ModelFront), Zak Nyberg and Ana De Agostini (The Church of Jesus Christ of Latter-Day Saints), Achim Ruopp (Polyglot Technology), Mei Zheng (Smartling), JP Barraza (Systran), Todd Flaska (Lingoport), and Karni Berlad Cohen (Lexicala). The audience was then asked to vote on the impact of the solutions presented according to several criteria: scalability, quality, interoperability, language data, and business impact. Based on the final votes, the winner was Mei Zheng from Smartling with her presentation on Quantifying Quality with Translation Memories.

New Ideas and Heightened Awareness

These two days of insightful conversation took place in a living room setup on stage to make everyone more in tune with the work-from-home trend that has been the reality for most of us in this industry in the last couple of years. From the conference hall to the opening reception sponsored by SYSTRAN and the networking dinner in a beautiful winery overlooking a breathtaking sunset sponsored by Smartling, I am confident to say that the attendees had a fulfilling and inspiring experience and took back with them a suitcase full of new ideas and a higher awareness on what is in store for the industry in the days to come.

The Future is Multilingual, Multimodal, Massive

Notes from the TAUS Massively Multilingual Conference 2022