Every company ‘sits’ on a mountain of language data in translation memories and content management systems. But that data are locked up in legacy formats and templates that make them not very useful and accessible in the modern scenarios of machine translation...
Agenda
Problems in data (why cleaning is required)
The available tools and their limitations
Cleaning based on sentence embeddings (Laser, LaBSE)
Comparison with examples