The amount of training data you need depends on many variables - the model you use, the task you perform, the performance you wish to achieve, the number of features available, the noise in the data, the complexity of the model, and more.
While there is no set answer to how much training data you will need for your given machine learning application, we do have some key guidelines. Generally speaking, the first rule of thumb is that the more training data a model has, the better the outcome. The higher the volume of training data, the less likely a model will overfit, or capture too much noise, taking away from the data’s true signal. Moreover, more training data will reduce the chances of a high bias (when a model oversimplifies assumptions).
Next, using domain expertise can help you narrow down to a suitably sized training set. Training data should ideally be independent and identically distributed, to avoid the problem of an imbalanced dataset. Accordingly, there should be enough data in the training set that captures all relationships that may exist for a model to be able to effectively map the input to the predicted outputs.
Lastly, intuition based on your given machine learning model can help you understand how much training data your given model needs. While there is no golden rule, some machine learning models are known to need more training data than others. For regression problems, it is suggested to have at least ten times more data points than the number of features present. For image classification problems, tens of thousands of images are needed to build a robust classifier. For natural language processing problems, tens of thousands of samples are needed for the model to see enough variation in text data.
Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.