TAUS has just published a new report on the role of language data in the AI paradigm – LD4AI. This explores the origins and scale-up of the current role for language data moderation in translation pipelines driven by machine learning, supported by “humans in the loop.” One finding is that large-scale data management will expand the kind of jobs required. In this respect, it may be useful to understand how language also acts to produce data beyond the translation moment. This will likely foster new types of work for language professionals. Let's look a little closer at why “language data” is a richer concept than you might think.
Language becomes data in two distinct ways – let’s call them HLD (Human Language Data) and DLD (Digital Language Data).
When we use DLD to drive a translation process, we select a chunk of bilingual text to train an algorithm to seek patterns in data so that the machine can then help translate a new batch of target language data from a new source text between the same languages.
To improve quality and machine-readability, we clean up and enrich the source data first by tagging phenomena such as untranslatables or ambiguous expressions, debugging any unwanted gender or racial references, annotating named entities, and so on. This human-moderated source data is then ready to enter the machine process of learning these data points and translating them all appropriately into another language. Data moderation therefore optimizes DLD for a machine learning or AI operation.
Data as Signals
Back in the social world of encounters between content, people and language, that same translated content will have a particular impact on each human reader. For them, language is not a mass of word embeddings and vectors familiar to neural MT engineers, it is a medium for messaging in a specific human tongue for some further purpose - informing, engaging, seducing, evaluating, making decisions, entertaining. And telling lies.
Human language, in other words, is always grounded in speech acts that create various psychological effects. And in today’s online life, readers’ and listeners’ reactions - such as slowing down or speed-reading a text, hesitating over an unknown word, eyeballing a certain proper name for more than two seconds, “liking” it, requoting it, etc. – all become useful new data for the publisher of that text. These reactions are not the producer’s language data in our LD4AI sense, but information about a receiver’s behavior that signals attitude and sentiment, engagement or rejection. Surveillance data, if you prefer, though the term has dark connotations.
Signals as Data
Surveillance of this type is a constant in our own conversations: we instinctively scan each other’s faces and body language to spot signs of assent, discord, doubt, collusion, or rejection. We have evolved to be alert to unusual word choices, voice tones, hesitations. When we scan a Tweet we note tell-tale signs in the humor, register, misspellings, or word choices. Not all these micro-signals are encoded clearly in the language, but they are easily inferred from the overall communicative experience. Indeed, one of the distinguishing marks in digital network life as a whole has been the automation of surveilling content for signs that produce useful data for other uses. This is especially true for our acts of speaking and writing, reading and listening. Even silence can speak volumes...
So now that content owners, marketers, communicators, and internauts globally are all able to elicit more insights from tracking the reactions of reader-users to the varied signals encoded in acts of language, they will inevitably attempt to control the game by designing forms of language communication that augment the desired signals. The aim is to optimize such audience reactions, even weaponize them. Not only for written text but even more effectively in the spoken language now spreading through all our new voice channels. This form of DLD will also expand the range of potentially translatable content.
As part of this transition to LD4AI, therefore, we are entering a virtuous circle of mutual reinforcement between data and signals. Translation suppliers are already providing language data moderation services to better inform the machines that speak, write and translate; their journey may soon include harvesting new types of speech, signed and text data derived from human reactions to their clients’ translated content as well.
Long-time European language technology journalist, consultant, analyst and adviser.