Domain Adaptation: Types and Methods

19/12/2022
5 minute read

Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.

There is still a lack of the amounts of labeled data required to feed data-hungry neural models, and in some domains and languages even unlabeled data is scarce. In addition, variation across different domains makes it difficult to adapt machine learning models trained on data from a certain domain to data from different domains. Together, these factors result in a considerable decrease in the portability of many NLP models. To address this challenge, various methods of domain adaptation have been proposed and adapted for many natural language processing applications.

Domain adaptation is a sub-discipline of transfer learning that deals with scenarios in which a statistical or neural model trained on a source distribution is used in the context of a different (but related) target distribution. When this happens, we usually speak of a domain shift. 

Technically, domain shift is a violation of the general principle in (supervised) machine learning that the training dataset should be drawn from the same distribution in which the trained classifier will be applied to make predictions for previously unseen data instances. In simple terms, this means that the training and test sets must be sufficiently similar to each other. 

For instance, if one’s goal is to predict the sentiment of tweets using an ML algorithm, it is required that the algorithm be trained on tweets in the first place as opposed to some other kind of text, such as news articles or movie reviews. In cases where this principle is violated, it is expected that the performance of learning algorithms is going to drop significantly, as they are no longer capable of generalizing beyond the training data.

It is important to note here that the word “domain” is rather loosely defined within the NLP community. Most often it refers to some coherent kind of text collection, such as texts that can be grouped together according to topic, style, genre, or linguistic register. In this article, we are not going to attempt to define the word “domain” any further, but rather rely on this existing, loose definition. In addition, the concepts “source domain” and “target domain” usually refer to the domain on which a given ML model is trained and the domain with a different distribution on which it is tested, respectively.

Mailing_QE_Scoring_001 (1)

Types of Domain Adaptation

Domain adaptation approaches can be categorized into three categories according to the level of supervision used during the training process. This is similar to the standard three-way categorization of machine learning models along the same axis.

  1. In supervised learning scenarios, a large amount of labeled data is available for both source and target domains; however, when domain adaptation techniques are used, this is usually not the case.
  2. In a semi-supervised setting, a large amount of labeled data is available in the source domain, but there is only a much smaller amount of data available from the target domain.
  3. In unsupervised domain adaptation, there are no labels available for either the source or the target domain, which actually makes this setting the most similar to real-world scenarios.

Domain Adaptation Methods

Domain adaptation can be further divided into categories according to the method used to transfer knowledge from the source to the target domain. These approaches can be classified as either model-centric, data-centric, or hybrid.

Model-centric approaches achieve domain adaptation by redesigning parts of the model. They include feature-centric and loss-centric methods.

The following is a list of the most prominent feature-based approaches.

  • Feature augmentation: e.g. the use of pivots (features that are common across source and target domains) to find an alignment between the two domains.
  • Feature generalization: the data is projected into a lower dimensional feature space, which is computed based on the features of both domains. This allows the resulting latent representations to be used as a means of transferring knowledge from one domain to another. Autoencoders, neural networks that are capable of efficiently finding such representations by producing intermediate encodings based on which they can regenerate their input, are commonly used.

Loss-centric methods focus on altering the loss function of the model in some way:

  • Domain adversaries: inspired by generative adversarial networks (GANs), these algorithms are able to reduce the differences between source and target domains by leveraging feature representations the origin of which (whether they represent some feature from source or target) cannot be identified. Domain adversarial neural networks (DANNs) have been applied in a variety of NLP tasks, including sentiment analysis, language identification, relation extraction, stance detection, and more.
  • Reweighting: this approach is based on the idea that weights can be assigned to individual data instances in the source domain based on their proportional similarity to the target domain. Instances can also be discarded unless they meet a particular threshold of relevance in both domains.

On the other hand, data-centric methods make use of certain aspects of the data rather than changing the model architecture or its loss function.

  • Pseudo-labeling refers to the process of using a model that was trained on labeled data to automatically predict labels for a data set which is then treated as a kind of “pseudo-gold standard”. Labels generated this way are called “pseudo” or “proxy” labels.
  • Data selection aims to select the data from the source domain which most closely matches the target domain. Although this area of research is relatively unexplored, it has been applied in machine translation before.
  • Pre-training is perhaps the most popular domain adaptation method in NLP today. Ever since large, Transformer-based pre-trained models became available a few years ago, fine-tuning these general models to more specific tasks using small amounts of labeled data has become standard practice. Pre-training has been shown to work very well in a variety of applications, but many open questions and challenges still remain.

Another possibility is to combine training data from multiple source domains, which can also increase the chances that a particular model would perform better on a different target domain. This approach is known as multi-source domain adaptation.

And finally, hybrid models make use of a combination of model- and data-centric approaches and they are currently being studied extensively.

Conclusion

Domain adaptation offers a large variety of techniques that can help increase the performance of NLP models in scenarios where little or no training data is available for the target domain. By bridging the gap between source and target domains, these methods are increasingly being used to produce more and more efficient NLP applications.

Get in touch with NLP experts at TAUS to help you get
high-quality in-domain datasets for your ML and AI applications

Author
anne-maj-van-der-meer

Anne-Maj van der Meer is a marketing professional with over 10 years of experience in event organization and management. She has a BA in English Language and Culture from the University of Amsterdam and a specialization in Creative Writing from Harvard University. Before her position at TAUS, she was a teacher at primary schools in regular as well as special needs education. Anne-Maj started her career at TAUS in 2009 as the first TAUS employee where she became a jack of all trades, taking care of bookkeeping and accounting as well as creating and managing the website and customer services. For the past 5 years, she works in the capacity of Events Director, chief content editor and designer of publications. Anne-Maj has helped in the organization of more than 35 LocWorld conferences, where she takes care of the program for the TAUS track and hosts and moderates these sessions.

Related Articles
11/03/2024
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
09/11/2023
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
19/12/2022
Machine learning and AI applications need data in order to work. And in order to get good results and output, the cleaner the data, the better.