The Pareto principle, also known as the 80-20 rule, asserts that 80% of consequences come from 20% of causes, rendering the remainder way less impactful.
Those working with data may have heard a different rendition of the 80-20 rule: A data scientist spends 80% of their time at work cleaning up messy data as opposed to doing actual analysis or generating insights. Imagine a 30-minute drive expanded to two-and-a-half hours by traffic jams, and you’ll get the picture.
As tempting as it may be to think of a future where there is a machine learning model for every business process, we do not need to tread that far right now.
While most data scientists spend more than 20% of their time at work on actual analysis, they still have to waste countless hours turning a trove of messy data into a tidy dataset ready for analysis. This process can include removing duplicate data, making sure all entries are formatted correctly and doing other preparatory work.
On average, this workflow stage takes up about 45% of the total time, a recent Anaconda survey found. An earlier poll by CrowdFlower put the estimate at 60%, and many other surveys cite figures in this range.
None of this is to say data preparation is not important. “Garbage in, garbage out” is a well-known rule in computer science circles, and it applies to data science, too. In the best-case scenario, the script will just return an error, warning that it cannot calculate the average spending per client, because the entry for customer #1527 is formatted as text, not as a numeral. In the worst case, the company will act on insights that have little to do with reality.
The real question to ask here is whether re-formatting the data for customer #1527 is really the best way to use the time of a well-paid expert. The average data scientist is paid between $95,000 and $120,000 per year, according to various estimates. Having the employee on such pay focus on mind-numbing, non-expert tasks is a waste both of their time and the company’s money. Besides, real-world data has a lifespan, and if a dataset for a time-sensitive project takes too long to collect and process, it can be outdated before any analysis is done.
What’s more, companies’ quests for data often include wasting the time of non-data-focused personnel, with employees asked to help fetch or produce data instead of working on their regular responsibilities. More than half of the data being collected by companies is often not used at all, suggesting that the time of everyone involved in the collection has been wasted to produce nothing but operational delay and the associated losses.
The data that has been collected, on the other hand, is often only used by a designated data science team that is too overworked to go through everything that is available.
All for data, and data for all
The issues outlined here all play into the fact that save for the data pioneers like Google and Facebook, companies are still wrapping their heads around how to re-imagine themselves for the data-driven era. Data is pulled into huge databases and data scientists are left with a lot of cleaning to do, while others, whose time was wasted on helping fetch the data, do not benefit from it too often.
The truth is, we are still early when it comes to data transformation. The success of tech giants that put data at the core of their business models set off a spark that is only starting to take off. And even though the results are mixed for now, this is a sign that companies have yet to master thinking with data.
Data holds much value, and businesses are very much aware of it, as showcased by the appetite for AI experts in non-tech companies. Companies just have to do it right, and one of the key tasks in this respect is to start focusing on people as much as we do on AIs.
Data can enhance the operations of virtually any component within the organizational structure of any business. As tempting as it may be to think of a future where there is a machine learning model for every business process, we do not need to tread that far right now. The goal for any company looking to tap data today comes down to getting it from point A to point B. Point A is the part in the workflow where data is being collected, and point B is the person who needs this data for decision-making.
Importantly, point B does not have to be a data scientist. It could be a manager trying to figure out the optimal workflow design, an engineer looking for flaws in a manufacturing process or a UI designer doing A/B testing on a specific feature. All of these people must have the data they need at hand all the time, ready to be processed for insights.
People can thrive with data just as well as models, especially if the company invests in them and makes sure to equip them with basic analysis skills. In this approach, accessibility must be the name of the game.
Skeptics may claim that big data is nothing but an overused corporate buzzword, but advanced analytics capacities can enhance the bottom line for any company as long as it comes with a clear plan and appropriate expectations. The first step is to focus on making data accessible and easy to use and not on hauling in as much data as possible.
In other words, an all-around data culture is just as important for an enterprise as the data infrastructure.