Happy-Halloween.jpg

19 October 2020

Blog

Common Data Science Problems: How to Defeat the Monsters in the Machine

You’re ready to tackle a machine learning project—congratulations! But before you get your model out into the world where it can have an impact on your business, you’ll have to mash some of the most common data science monsters.

Here’s how you can beat them, without getting spooked.

The Master of Illusion

The Master of Illusion whispers in your ear. You’ll be able to start a machine learning project just as soon as you get the perfect data lined up, which should be just around the corner… Next week… Maybe next month…

Plug your ears, cover your eyes—this is just an illusion! You should start with the data you have available now, regardless of what it looks like. The data science process is inherently an iterative one, and you’ll be able to train your models with improved data as soon as it becomes available. There’s no need to wait!

The Mad Scientist

You know you should be building a cross-functional team to get your data science project off the ground, but as you toil away writing code and training models, you realize you’ve been glued to your chair for as long as you can remember. You can’t ever remember the last time you checked in with others on your team. You’ve become a Mad Scientist, locked away in your lab!

To overcome your Mad Scientist leanings, remember that data science is a team sport—don’t fall into the trap of thinking that working solo is easier and more effective. Reach out to other teams early, even if you aren’t sure you’ll need their help. Also ensure that you’re choosing platforms and tools that include version control and allow teams to work together, including easily sharing code and models, to drive business impact.

The Leaders

Your project will succeed, you know it! But the Leaders aren’t convinced. If you want to make a model and get it into production, you’ll have to convince them before you’ve even started.

The Leaders have two weaknesses. The first is demonstrating ROI for your project that leadership will have trouble saying no to. The second is having a clear plan of action. Make sure to have these in hand before anyone asks you to take them to your leader, you can show that you’re aligned with the Leaders’ digital transformation objectives and they won’t be able to say no.

The Detail Devil

“Sure, I’d be fine to reduce shipping costs by 5%, but what if instead we replaced our entire workforce—including you—with AI?” The Detail Devil shows up late in the planning process, has lots of ideas, and doesn’t want to take no for an answer.

But by having a clear set of project goals and outcomes defined from the beginning, you can overcome the Detail Devil. There’s always time for new project after the current one finishes, but for this project, stay focused on the original plans so that you can demonstrate value as quickly as possible.

Phantom Data

If you’ve overcome three previous monsters, you probably feel ready to get started! But there’s another monster that you’ll have to overcome as soon as you start: Phantom Data. Your data could be riddled with missing values!

Missing values can really put a damper on machine learning projects. Validate your data beforehand so you know what data you have, and what data’s missing, so that you can make a plan to account for those gaps the right way.

The Scarecrow

Your data is neat and organized into clear rows and columns, like a field of corn—but what’s that sticking up in the middle? The Scarecrow is an outlier, distorting your results and making your data less powerful than it could be otherwise.

Data preprocessing is a critical step for machine learning projects to help mow down any Scarecrows hiding in there. Knowing what your data looks like ahead of time will help you address any of these from the very beginning.

The Gemini

Just when you think you’ve solved all of your data problems, the Gemini shows up—identical rows sprinkled across your dataset. The Gemini hopes that you’ll chop it in half, with some of the duplicates ending up in your training set, and some in your test, completely ruining your model development process.

Don’t let the Gemini in! Validate your data ahead of time to check not only for missing values and outliers, but also to ensure that you haven’t created any Geminis while putting together your dataset.

The Werewolf

Your data finally looks good—no missing values, no duplicates. You’re ready! But wait, what’s this? It looks like all of the empty values in your dataset have been transformed into “99999”! The Werewolf is at work, seeding your dataset with false information to sabotage your model.

The silver bullet to defeat the Werewolf is to make sure that you don’t fill your empty values with a generic, made-up value. Instead, you’ll want to impute them from the data that you do have.

The Data Horde

You’re ready! Your data is cleaned and sanity checked and you’re ready to start model training. But then the Data Horde shows up, with so many rows and columns that your machine is overrun with values.

To overcome the Data Horde, you’re going to need help—lots of help. If you’re not in a position to summon lots of processing imps to speed up your computations, you’ll want to rely on cloud computing solutions like RapidMiner AI Hub.

Dictionary Gnomes

You’re nearly there—your model is producing coherent results, but something seems off. Maybe the unit of one for your inputs isn’t what you thought it was. Was that units per second or units per hour? You look up your data dictionary to check and— It’s gone! It’s been destroyed by the dreaded Dictionary Gnomes! Or perhaps you failed to make a note in the first place…

By far the easiest way to overcome Dictionary Gnomes is to make sure you’re document things and building out your data dictionary correctly, from the beginning of each project. Also, ensure that you’re using a data science platform that supports collaboration and documentation so that the answers you need are always at hand.

The Witch

Just a couple last-minute tweaks and you’re model’s ready. But hold on—there’s three different “final” versions of the model code due to uncoordinated forking of your project code. The Witch has snuck into your project, creating copies of your project to confuse you!

To throw water on the Witch and her copies, you’ll want to ensure that you’re using a platform that has version control and model development governance built in so that you can easily roll back to previous states’ data sources, models, and code, as well as resolve any forks that arise.

The Mummy

Your model is up and running, and ready to make predictions when you feed it data. Awesome! The results will be back shortly. Any time now…

Still waiting? It’s possible your pipelines have been clogged up with the cobwebs, wrappings, and slow lumberings of the Mummy, slowing down your model’s responsiveness in production.

There’s only one surefire way to overcome the Mummy’s curse, and that’s to use a platform that offers real-time scoring so your input data and predictions don’t get slowed down while your model is serving up predictions on the fly.

Frankenstein’s (Data Science) Monster

Your model is perfect! It’s tuned to 99.9999% accuracy, and it’s going to change everything—until you let it out into the world, and it turns out to be a monster! What was once the apple of your eye is wreaking havoc with its inaccurate predictions.

To ensure that you model isn’t a monster, your focus shouldn’t be on a single metric like accuracy, but rather on model resilience—how well a model behaves on differing datasets. Using a platform with strong model operations can also support training, testing, and deploying models so you can ensure you’re prioritizing ROI over just accuracy with profit-sensitive scoring.

Success!

You made it – you’ve successfully overcome some of the most challenging data science monsters and gotten into production!

If you’re looking for help defeating these monsters on your next journey, check out our Building the Perfect AI Team ebook and learn how to assemble a team that is fit to vanquish all your data science problems.

Related Resources