Data is promised to be the next big business (e.g., Wall 2019, Gartner 2018a). Investment banks, analysts and consultants further feed the frenzy with big revenue forecasts. In terms of data monetization opportunities, consultants McKinsey & Company estimate that car-generated data alone will be worth between US$450 billion and US$750 billion by 2030, less than two vehicle generations away (McKinsey 2016). Consumer data is already a big business today. Google and Facebook live off the data that users create on their platforms. Almost all their revenue is from advertising, selling “eyeballs” and user engagement to advertisers.
All of the above is just the beginning. A big data boost is expected from IoT data (Internet of Things): IoT is essentially turning objects into websites. Historically, the Web and website tracking created a first wave of Big Data (which in turn created new technology to store and process it, such as Hadoop). Now ordinary objects are turned into websites. For example, cars: connected and autonomous vehicles are projected to generate four terabytes (TB) of data a day (Krzanich 2016). Furthermore, this IoT boom is fueled by a confluence of trends in information systems, such as miniaturization of sensors like lidar (light detection and ranging sensor for autonomous cars (NOAA 2020)), device technology, e.g., edge computing, and a new 5G cellular mobile communications standard.
Figure 1: The productivity problem in data analytics
A key mechanism to release value from data is analytics. With Websites, it took tools like Google Analytics (formerly known as Urchin) to benefit from Website tracking and attract advertising budgets. Google Analytics is mostly descriptive analytics. Far more value is generated from consecutive stages of predictive and prescriptive analytics (McKinsey 2018, Gartner 2018b). Examples include product recommendations using machine learning as an amplifier of word-of-mouth marketing (e.g., on Amazon and Netflix); and the application of deep learning or neural network methods across many domains for the recognition of text (sentiment analysis), image (automatic license plate recognition, ALPR), video (autonomous vehicles) and speech (Amazon’s Alexa virtual assistant). Yet despite the media hype, a quick review of time spent in data analytics projects reveals a big problem.
Companies have gone from databases to data warehouses and now to data lakes (Porter & Heppelmann 2015). And they seem to be drowning in data (it is even unclear how to quantify the size of data, see “Data: How to measure it?” link). If “time is money” as famously noted by one of the founding fathers of the United States (Franklin 1749), then data analytics is a disaster. Today, according to the literature more than 80% of the time budget of a data analytics project is spent on data processing and wrangling – not with algorithms (Press 2016, Vollenweider 2016). This would turn the 80/20 Pareto principle, a cornerstone of business efficiency, upside down (e.g., Neuman, M.E. 2005). Figure 1 illustrates the data analytics productivity crisis.
We even conducted our own analysis using surveys. We acknowledge that surveys are often a weak means to support an argument; surveys are popular because they are quick and easy, but results are often bad and misleading. Problems with surveys range from data collection (questionable representativeness, abysmal response rates, etc.) and design (bias in survey instruments, leading or ambiguous questions, inadequate response options, etc.) to interpretation and extrapolation of results (lack of statistical significance, rating level inconsistencies, etc.). Knowing about these survey pitfalls, we put a premium on representatives and simple, unambiguous questions. Our sample is a convenience sample but chosen to maximize its representativeness. Because our focus is on data analytics in business, our data was collected at data science events, which were specifically targeted at data experts in business – and not at an academic or research audience. Figure 2 depicts our survey findings, questions, and results.
Figure 2: Time spent in the data analytics pipeline
Our survey of data experts confirms the problem. If an analytics project is broken into the three phases of (a) data processing, (b) analytics modeling & evaluation, and (c) deployment, then timeshares are reported as 48%, 32%, and 20% respectively (n = 65). The implications are clear: For data analytics to become successful, the data productivity problem has to be solved. Other industries offer clues on how to solve it. For example, the auto business: Data processing for AI remains handmade and made-to-order just like cars before Henry Ford industrialized automaking. Gottlieb Daimler invented the motor car in 1886, but it was Henry Ford who invented the modern auto business about 20 years later (Womak et al. 1990).
Henry Ford evolved the automaking from a hand-made affair to mass production. He invented the auto business with factories. A factory is about automation and productization. Automation is obvious. The moving assembly line is probably the most visible and striking feature. However, less obvious, for automation to work Ford critically required interchangeability of parts, which in turn required metrics (Clark and Fujimoto 1991). Parts had to be made to precise measurements so that all copies of a part were similar to be attached to cars coming down the line quickly without lengthy calibration and refitting work. Mechanical engineering introduced the notion of tolerance as “the range of variation permitted in maintaining a specified dimension in machining a piece” (Webster 2019). Parts were specified (“specced”) in engineering drawings or “blueprints” and then manufactured within precise tolerances to make them interchangeable. The challenges with data pertaining to both measurement and automation (Crosby & Schlueter Langdon, 2019). As of 2020 data attributes have remained qualitative and subjective. For emerging quality metrics, see “Data: Quantity or quality?,” link. For first solutions for data productization and automation, see “Data factory for data products,” link.
Clark, K. B., and T. Fujimoto. 1991. Product Development Performance: Strategy, Organization, and Management in the World Auto Industry. Harvard Business School Press: Boston, MA
Crosby, L., and C. Schlueter Langdon. 2019. Data as a Product to be Managed. Marketing News, American Marketing Association (April 24th), link
Franklin, B. 1748. Advice to a Young Tradesman. Printed in George Fisher, The American Instructor: or Young Man’s Best Companion. … The Ninth Edition Revised and Corrected. Philadelphia: Printed by B. Franklin and D. Hall, at the New-Printing-Office, in Market-Street, pp. 375–7, link
2018a. Gartner Top 10 Strategic Technology Trends for 2019 (October 15th), link
2018b. Gartner Forecasts Worldwide Public Cloud Revenue to Grow 17.3 Percent in 2019 (September 12th), link
Krzanich, B. 2016. Data is the New Oil in the Future of Automated Driving. Intel Newsroom (November 15th), link
McKinsey Global Institute. 2018. Notes from the AI frontier – Insights from hundreds of use cases. McKinsey & Company (April), link
McKinsey & Company. 2016. Monetizing car data. Advanced Industries Report (September), link
Newman, M.E. 2005. Power laws, Pareto Distributions, and Zipf’s law. Contemporary Physics 46(5): 323–351
NOAA, National Oceanic and Atmospheric Administration. 2020. What is LIDAR?, link
Porter, M. E., and J. E. Heppelmann. 2015. How Smart, Connected Products Are Transforming Companies. Harvard Business Review (October), link
Press, G. 2016. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes (March 23th)
Vollenweider, M. 2016. Mind+Machine: A Decision Model for Optimization and Implementing Analytics. John Wiley & Sons: Hoboken, NJ
Wall, M. 2019. Tech trends 2019: The end of truth as we know it? BBC (April 1st), link
Womack, J. D. Jones, and D. Roos. 1990. The Machine That Changed the World: The Story of Lean Production. Free Press, Simon & Schuster: New York, NY