Any prudent investor will carefully weigh investment options. In data science, data is an important ingredient and any prudent data scientist would have to decide on data. So, we asked leading data scientists: How do you spend the next US$1, on more data (quantity) or better data (quality)?
Figure 1: Data quantity or quality?
The response is crystal clear: Quality beats quantity. So far, so good. However, if we drill deeper, if we want to find out, how this result translates into practice, then suddenly, all clarity is lost. True to Peter Drucker’s idiom of “you cannot manage what you can’t measure” the next question is: How to measure data quantity and how to measure data quality. Again, we asked those same data experts.
Figure 2: How to measure data quality?
This time, clarity is lost. In fact, results confirm the mystery of measuring data quality.
The data science community confirms the old adage of “garbage in, garbage out” (GIGO): “Dirty data” is seen as the most common problem for workers in data science according to a survey with 16,000 responses on Kaggle (Kaggle 2017). With data analytics, all insights are extracted from inside the data. Therefore, it is imperative to ensure that any raw data used has the information required for insights into it. One analogy is iron ore: For iron one would need rocks so rich with iron oxides that metallic iron can be extracted. Without iron oxide, in it, a rock would simply be a rock, not iron ore.
Yet, despite the importance of data quality, very little progress has been made to operationalize this insight. The literature is using concepts, such as the “3 Vs” of volume, velocity and variety (McAfee & Brynjolfsson 2012); and more Vs are being added, like variability and value (e.g., Yin & Kaynak 2015). However, from an operational perspective, from an analytics application point of view, the Vs have remained conceptual and qualitative. The Vs may be useful for a first assessment, maybe for a pre-test, a first triage type data selection. However, in order to gauge outcomes in terms of performance, to estimate the likelihood of effects (x improves y), the size of effects (x improves y by a lot) and significance (improvements are real, not random), the Vs have too little information in them.
Quality scoring could make data quality manageable (Crosby & Schlueter Langdon 2019, Schlueter Langdon & Sikora 2019). It would help both data scientists and management make better investment decisions on data. Quality scoring is even a well-established business – but not yet with data. As consumers, most of us are probably familiar with Consumer Reports in the US (“Stiftung Warentest” in Germany), which is testing and rating consumer products; car buyers are likely checking J.D. Power’s quality scores from the vendor’s Initial Quality Study (IQS, problems after 3 months) and Vehicle Dependability Study (VDS, problems after 3 years); home buyers are worried about credit scores (FICO in the US, SCHUFA in Germany) and familiar with credit scoring agencies, such as Equifax; and finally, bond investors watch ratings of creditworthiness of corporate bonds ratings from specialists, like Moody’s (Aa1) and Standard & Poor’s (AA+).
Crosby, L., and C. Schlueter Langdon. 2019. Data as a Product to be Managed. Marketing News, American Marketing Association (April 24th), link
Kaggle. 2017. The State of Data Science & Machine Learning, link
McAffee, A., and E. Brynjolfsson. 2012. Big Data: The Management Revolution. Harvard Business Review (October): 60-68
Schlueter Langdon, C., and R. Sikora. 2019. Creating Data Factories for Data Products. Proceedings of 18th Workshop on E-Business, ICIS Munich, Germany
Yin, S., and O. Kaynak. 2015. Big Data for Modern Industry: Challenges and Trends. Proceedings of IEEE 103(2): 143-146