layer2layer2
layer3

Data for AI

Train your LLMs and MT engines with high-quality domain-specific multilingual corpora, carefully curated by TAUS data experts. Explore our offering below or get in touch to obtain the data you need.

Grow your model with TAUS quality data

TAUS offers a core collection of 7.4 billion words (483 language pairs) high-quality multilingual training data at very attractive prices to developers of AI models, LLMs and MT engines. 

Download Data Catalog & Pricing

Customize your model with TAUS domain-specific datasets

TAUS has a library of hundreds of domain-specific datasets in dozens of languages ready to be used for the customization of your MT engines and LLMs. If we don’t have the dataset for your domain or use case, we can create the right training dataset on demand.

Contact TAUS to request a list of domain-specific datasets

Expand your model into new languages

TAUS offers colloquial datasets in 26 low-resource languages out-of-stock. Through the Human Language Project platform and global community of over 40,000 HLP workers TAUS can deliver datasets in under-resourced languages and domains on demand. 

Contact TAUS to request a list of low-resource datasets