Skip to content

Latest commit

 

History

History
 
 

how_much_data_do_you_really_need

How Much Data Do You Really Need?

Deep learning models are notorious for their endless appetite for training data. The process of acquiring high quality annotated data consumes many types of resources — mostly cash. The growing amounts of data as the machine learning projects progress, lead to other undesired consequences, such as slowing down all of R&D. Therefore, veteran project leaders always look at the overall performance gains brought upon by additional increments of their dataset. More often than not, especially if the new data is relatively similar to the existing one, one will encounter the phenomena of Diminishing Returns.

The law of diminishing returns states that when you continuously add more and more input in a productive process, it will actually yield progressively smaller increases in output. This phenomena was mentioned by 18th century economist such as Turgot and Adam Smith and articulated in 1815 by the British economist David Ricardo. When addressing the influence of training data volume on model performance, the law of diminishing returns suggests that each increment in train set size will tend to contribute less to the predetermined success metrics.

When a project leader is able to monitor, and even quantify, the diminishing returns effect in their machine learning project, they are able to attain finer degrees of control throughout its lifetime. For example: estimating how much data is required to reach the project goal; avoiding redundant training sessions; or even predicting whether the current model architecture will be able to achieve the target metric. This knowledge effectively provides a tool for optimal management of time, manpower, and computing resources.