So you went to college and during your math and computer science classes you were drawn to the beautiful art and science of data, machine-learning and artificial intelligence.
With the combination of those two majors, who wouldn’t want to use the magic of machine-learning and artificial intelligence technologies to help the world figure out how to predict a patient’s risk of a hospital acquired disease like sepsis, which kills 200,000 people each year in the US alone?
Who wouldn’t want to use machine-learning to prevent the catastrophic failure of an oil rig in the Gulf of Mexico or develop deep learning models to help machines detect cancer in pathology images?
No brilliant college-going math and CS major in their right mind wouldn’t want to find that gem of data hiding in massive data sources that triggers the insights that can change the course of a lethal disease, or some other world-shattering advancement.
These college grads typically end up as data scientists and go on their way into the professional world with the hope of becoming heroes.
Welcome to the Real World for Data Scientists
So now you’ve landed your dream job as a data scientist, got a plum job earning a great starting salary and are part of one of the hottest teams in your company. You get your first project and get access to piles of data. You can’t wait to dig in to find precious nuggets, build the career-defining model, and prove to your boss that you’ve got what it takes to change the trajectory of the company.
But then the unthinkable happens. Instead of handing you a dataset, like the one in your Computer Science 351 class, you’re given a list of all the massive data sources to look in before you can even train the model. Panic sets in. One of the datasets is a product database sitting in an SAP instance in France, there’s a MongoDB database under your colleagues desk, two Hadoop clusters, plus a Teradata data warehouse. These data sources are where the nuggets of data may hide.
“Data scientists… spend from 50 percent to 80 percent of their time …preparing unruly data, before it can be explored for useful nuggets.” – For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights, The New York Times
The thought of these 6 to 7 data sources sounds like a goldmine of information but the reality of finding the relevant data in them to build your model is daunting. You begin to wonder, “Have I been duped?” Should you call your old professor to ask him why he didn’t warn you that you would have to wrangle together your own highly complex, massive dataset from a myriad of sources? Data sources that most of your colleagues don’t know much about?
Welcome to today, where the explosion of complexity and chaos from data residing across many silos reigns supreme.
Your job just changed from being an admired data scientist to someone who is straddled with simply trying to access, discover and find all the relevant data from the 6 to 7 sources in a hurry, and then getting to the real work of building models on this data.
Each time you take a swing at these data sources to grab the relevant data and miss, you build and train your model on a subset. If that model output is uninteresting, then you swing at these sources again, grab more data, other dimensions and round trip all over again.
This vicious cycle continues and you realize that the vast majority of your time and brainpower is being spent on data access, data discovery, data blending and data mash-ups, and many frustrating attempts to just find the ‘Dataset Unicorn’.
“76% of data scientists view data preparation as the least enjoyable part of their work.” –Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Forbes
And so today, the official description of a Data Scientist is just that: someone who spends an enormous amount of time on data prep and data blending to eventually find their Dataset Unicorn.
Stop the madness!!
If only there were an easier way that used machine-learning and artificial intelligence to actually cut down the time you spent wrangling the data, help with this problem and give you the glory you so desperately deserve!
Is there any good news for the unsung data wrangling heroes out there? The answer is, yes. Many organizations have found a way to flip the 80:20 data preparation ratio on its head.
Now, Data Scientists are spending 80% of their time uncovering those world-changing nuggets of information, while machine-intelligence and artificial intelligence do the data preparation dirty work. This includes automating access, data inference and data profiling, transformation, data blending and data discovery through machine-intelligence and machine-learning which not only speeds up the entire process but also provides a scalable way to expand the capabilities to more models, more data and more insights.
“BI and analytic teams can barely keep up with the demand for, well, everything: more data, new tools, modify this report, migrate a source system … everything! We all need to work smarter…” – Pervasive BI and Analytics: Are We There Yet?, Gartner Inc.
Want to learn more about the secret weapon data scientists and business leaders alike are using? Watch the video to learn more about AI-Powered Data Preparation and Data Blending or try it on your own data.