Explore Our Language Data Repository - 7.4B Words, 450+ Pairs

Resources

Try for Free

Data for AI

Train your LLMs and MT engines with high-quality domain-specific multilingual corpora, carefully curated by TAUS data experts. Explore our offering below or get in touch to obtain the data you need.

Grow your model with TAUS quality data

TAUS offers a core collection of 7.4 billion words (483 language pairs) high-quality multilingual training data at very attractive prices to developers of AI models, LLMs and MT engines.

Download Data Catalog & Pricing

Customize your model with TAUS domain-specific datasets

TAUS has a library of hundreds of domain-specific datasets in dozens of languages ready to be used for the customization of your MT engines and LLMs. If we don’t have the dataset for your domain or use case, we can create the right training dataset on demand.

Contact TAUS to request a list of domain-specific datasets

Expand your model into new languages

TAUS offers colloquial datasets in 26 low-resource languages out-of-stock. Through the Human Language Project platform and global community of over 40,000 HLP workers TAUS can deliver datasets in under-resourced languages and domains on demand.

Contact TAUS to request a list of low-resource datasets

Learn More About Data

Training data is perhaps one of the most integral pieces of machine learning and artificial intelligence. Without it, machine learning and artificial intelligence would be impossible. Models would not be able to learn, make predictions, or extract useful information without learning from training data.

Learn more

Can't find the data you're

looking for?

Get in touch with our NLP team to see if we can help collect the right data for your needs.

Learn more