In our work with our users, we often find that people across organizations and industries have resistance to getting starting with machine learning until all their data is “perfect”. Arguments about perfect take different forms, but the assertion itself—“I need to wait until X”—is all too common.
In this post, we’ll explain what “perfect data” is and why perfection doesn’t exist, and then walk you through some of the most common objections to getting started so you can make sure you haven’t fallen victim to any of these ways of thinking yourself.
What is “perfect data”?
What makes data perfect? In our experience, when people start talking about waiting until X time in the future or until Y Initiative is in place (codewords for “perfect data”), they often mean that they have every possible or potential parameter accounted for in their dataset. Once they have all of these datapoints, they think, they’ll be ready to undertake a data science project.
The specific data points and parameters that folks want to wait for vary by industry and use case. It might be waiting until there’s a process in place to reliably clean data, or it might be building out a system to understand data access and validate where data is coming from. It could be waiting until a new sensor is installed, or until a new data warehouse schema is finished.
Although there are lots of reasons people think they need to wait to start a machine learning project, if their reasons are about waiting for better data, we know that they won’t stand up to a bit of digging. How can I be so sure?
Because perfection doesn’t exist
“Perfection” is impossible to achieve—your data could always have more data points, it could always be cleaner, you could always have a better understanding of where the data came from, and so on, and so on until you’ve gotten nothing done.
Add to that the fact that, given the ever-evolving nature of data, even if you were to get to a state of perfect data, your data requirements would likely shift by the time a project was finished—and sometimes before a project even gets off the ground. Deciding to wait for perfect is essentially deciding not to reap the benefits of machine learning in your business, meaning that you’re falling behind your competitors.
Not convinced? Let’s take a look at a few of the most common reasons we’ve heard for people waiting to start a machine learning project, and why they don’t stand up to scrutiny.
But my data isn’t clean enough!
Data is never clean. Every data point you record has flaws. The reasons are manifold. Customers move to other cities, sensors break, or employees use different spellings for the same word. For a data scientist, it’s normal to work with shortcomings in our data, and something we know how to address.
Although it would be nice if this data-oriented janitorial work wasn’t necessary, in reality, business and data understanding are part of the CRISP-DM process for a good reason. During the analysis, those working with the data will have a chance to get familiar with the data, allowing them to find issues and make a plan to cope with them during model training.
And although the data doesn’t need to be perfect, but obviously needs to fall within certain tolerances. Those tolerances are based on the industry, function, and use case. A pharmaceutical manufacturer is likely to have less error tolerance, while a marketing technology provider might be able to work with a larger degree of error in the data. Data that’s only 75% perfect may be acceptable for tactical, non-critical business predictions.
It’s important to find a middle path to make sure you have useful data without spending an eternity on getting things fixed up front, but it is possible!
But my data isn’t complete!
Data’s never complete. There will always be more information that could help with your use case. The daily annual rainfall at a mining site, the estimated income of your customer households, or the region’s average unemployment rate could all be useful depending on what you’re looking at but might not be readily accessible.
What’s more, adding that information might only have a small effect on your final model—something you won’t know unless you dive in and start the model building process right away. Don’t wait for the perfect set before you start experimenting!
But my data storage and access pipelines aren’t ready!
Data storage problems will never be fixed. Perhaps a few years ago, IT started creating the perfect representation for your data in your data warehouse. The warehouse has datamarts where users can select and download standardized and validated data easily. If you just wait for another quarter, they’ll have the perfect schema ready for you.
But then there’s a delay. A change in plans. A change in direction. And on and on until you’ve lots thousands of dollars that you could have saved if you’d implemented a machine learning project early on and iterated as things changed.
The point is that centrally maintained data sources are a good idea, but reality is different. You often have more than one source system, and you can’t wait for some central repository to pull it all together.
Plus, you might have some data that’s only available in an Excel file (or handwritten) that you also want to incorporate. Tools like RapidMiner Studio act as a kind of Swiss army knife for data collating data from different sources, without waiting for someone else to make your data look nice.
Then what’s the solution?
Rather than asking “what does perfect data look like”, it’s better to ask “what makes data good enough for my machine learning project to have an impact?” This question is much easier to answer, and to tie directly to business impacts.
If you can have business impact, you should move forward with your project
Even if the impact is small (say, $10k/month in cost savings), you’re still having some positive impact, you’re learning how to create and operationalize a machine learning project, and, because machine learning is an inherently iterative process, it will be quicker to get more impact in the future if you already have a model up and running that you can iterate on. There’s no reason to wait, even if your current impact is small.
Don’t overengineer requirements; be agile
Data science, like software engineering, is an iterative process. Don’t expect that your minimum viable product (MVP) will solve all of the problems you have on its first iteration. Sometimes, you have challenging requirements on your deliverables, and you need a final product that is going to both be challenging to build and do a lot when it’s finished.
Nevertheless, it’s useful to build an MVP first, even if it doesn’t check all the boxes. This way, you have a model that you can show to people, and which can demonstrate its impact even if it’s not complete, letting stakeholders see how your project can drive value and reduce risks. If this minimum product is viable it is good, it’s easy to keep iterating and get to your final state—you have more buy-in, and you’re already generating value.
Don’t blindly trust legacy solutions nor wait for the perfect one
“But we have an existing solution that works okay” is something that we often hear. And sure, since you run a business, you must have some solution in place. But AI is revolutionizing the ability of computing systems.
You should always carefully check for current, good-enough implementation against what is possible with machine learning.
The current solution is often something like an Excel sheet with a very simple equation. With a few clicks in the RapidMiner data science platform, you can easily import all of that and quickly generate something that’s much more impactful than your previous solution.
An real-world example
Take the example of Pontificia Universidad Javeriana, a university in Cali, Colombia. One of the objectives of a long-term strategic plan, the university wanted to strengthen decision making by leveraging the power of machine learning and big data to predict which students were in danger of dropping out of school before finishing their degree.
You can probably imagine some of the countless possibilities for collecting “imperfect data” inherent in such a project. Not only would the data scientists get data from controlled sources like high school and university transcripts, attendance records, and financial aid applications, but they would also be pulling data from outside sources like extracurriculars, social media networks, etc., that could be used to enrich the data set to help make more accurate predictions.
The university IT team, who was responsible for the project, worked with the rest of the administration to develop a set of goals, along with a series of hypotheses that would help explain specific reasons why a student population might drop out. Then, they collected data from sources, both ‘perfect’ from inside the university, as well as ‘imperfect’ from outside, to help support those hypotheses.
Did it work? They think they can easily reach a 10% reduction in the student dropout rate.
The effort of finding and using perfect data has to be measured against the value of waiting to use it. In the case above, the university realized that the value of rapid recognition and intervention into dropout-risk students far outweighed any causes for delay. And, if more data becomes available (for example, as the pandemic changes enrollment patterns), they’re ready to iterate quickly to improve their model.
An iterative approach is the most logical one when dealing machine learning, and that’s especially true when dealing with imperfect (read: all) data. Business leaders should consider what process they want to improve, or which business decisions would benefit from predictive analysis approach, and dive right in to see what kind of impact they can have.
Then, together with the data science and/or IT teams, they can look at what data is needed, the sources of that data, and how that data can be translated into practicable information and predictive analysis.
Trying to figure out how to make use of imperfect data to improve your business? Request a live demo with one of our data science experts and learn how RapidMiner can help.