Data science can be hard—we get it. That’s why we’re so invested in making it accessible to everyone with tools like RapidMiner Studio. If you’re new to data science, you might feel like everyone else is an expert, does everything perfectly on the first try, and that you’ll never catch up to them.
But trust us, that’s not the case.
Every data scientist has made mistakes in their career that they’ve been able to learn from and take into their new projects, building their expertise over time.
Four data science mistakes and how to avoid them
To help you, we polled some of the top thinkers at RapidMiner, asking them about memorable data science mistakes that they made and what they learned from the experience. This way, you can learn from their mistakes, without having to make your own.
1. Not comparing your model with a simple baseline
Ingo Mierswa, Founder
You might have a favorite kind of model. Maybe it’s something complicated like a convolutional neural network. Don’t get me wrong, that’s not a problem in and of itself! But if something complicated, time consuming, and labor intensive is your go-to model, you might consider whether you’re choosing this type of model too often. Is it just because you like it? Is there a simpler, easier-to-build model type that might work just as well?
I’ve been guilty of doing something like this more than once, jumping to something complicated and spending a lot of time tuning parameters and retraining the model, only to discover later that a simple regression model performed nearly as well.
Lesson: Don’t always jump to the big, cool thing—sometimes the basics are all you need to move a project forward quickly and effectively.
2. Having multiple observations for the same item
Martin Schmitz, Head of Data Science Services
I remember one mistake fairly well. We did a maintenance contract churn analysis for construction equipment. The initial pass at training the model did a really poor job of predicting churn, and I couldn’t figure out why. It took me several hours of staring at the data to realize the problem.
Many maintenance contracts are, of course, renewed, which meant that one company could be in the data set multiple times with different values: 2015-Renewed, 2016-Renewed, 2017-Renewed, 2018-Renewed, 2019-Churned. This creates a similar issue to the one that I addressed in a recent blog post about batched data.
Essentially, you’re confusing the model because it assumes that all of your observations are independent, but they aren’t. In this case, you’re trying to predict for one company, but you’ve got conflicting information about that company in the data. Plus, you’re increasing the number of Renewed cases by five times, which obviously skews the data towards predicting Renewed rather than Churned.
The lesson: Make sure that all your data points are independent from each other if you’re seeing a large overprediction for one class in your trained model. If you have this kind of problem, there’s a few options to make solve it, which you can read about in the previously mentioned batched data post.
3. Not having a clear understanding of the business case
Yuanyuan Huang, Data Scientist
Once, we were trying to predict churn for a pre-paid cell phone company. We got really solid results and were able predict when people were going to churn.
Unfortunately, it turned out that the vast majority of people we predicted to churn were tourists who were buying temporary SIM cards to use just while they were on vacation. It’s perhaps an interesting insight to know why churn is so high, but we didn’t need a model to be able to tell us that
Plus, there’s no actionable steps we could take to reduce churn in these cases, so it wasn’t of any benefit to the business. A good example of a “nice try, but not helpful” data science solution.
The lesson: Always make sure that you’re solving a business problem, not just a data science problem. It’s possible to do great data science but uncover an insight that’s not at all useful for the business, and a data scientist’s job is to deliver business value in the form of return on investment. Make sure you’re investing in models that will improve the bottom line.
4. Not fully understanding your data
David Arnu, Lead Data Scientist
I once worked on a project with a very large and obscure dataset. It seemed like it was going to be a challenging model to train, but after the training, the predictions were really good. Way too good. Unrealistically good.
It took quite some time, but I eventually realized that the original data dump to get the training data was performed twice—the second half of the data set was just a duplicate of the first half. This meant that the exact same data point sometimes ended up in both the training and test sets after splitting for validation, giving the model the right answer for a whole bunch of cases.
The lesson: Ensure that you understand where your data came from, and what’s in it, before you even think about starting to train models.
There you have it—four easily avoidable data science mistakes that even the best and brightest have made. Hopefully, with this knowledge in hand, you’ll make sure to avoid these mistakes in in your own work.
If you’re still learning about data science and machine learning and are wondering what kinds of impact it could have on your business, check out 50 Ways to Impact Your Business with AI to see the effect that AI is having across industries.