When it comes to data science, it’s not about what you learn. It’s about what you are able to build with what you’ve learned.
The field of data science has been growing rapidly—especially in the last few years. We see exciting new tools and methods emerge all the time. And while these can be great, I feel that these can cause some confusion as well. Why? Because they make data professionals think about the wrong questions.
Asking the wrong questions
What do I mean by asking the wrong questions?
Examples of wrong questions might be:
- What are the coolest new tools to try out?
- What are the most exciting data science problems nowadays?
- How can we fit these into our business (to experiment with them)?
Instead, we want to ask better questions like:
- What business problems (or opportunities) do we have right now?
- How can data help with this?
- Why and how will our data project be useful for the company?
- What should I learn to start building it?
Within data science, there is enormous hype around new tools every time a new machine learning algorithm is released. Or a new cloud-based solution is available. Or a new module is implemented for this or that programming language. And so on.
But aren’t these new tools important? Well, yes, but…
Tools are important, but with a caveat
Let’s think about an example from cook. You can’t cook soup without a spoon. But when eating the soup, very few people will say: “Hmmm, you have a pretty nice wooden spoon.” Instead, most of them will say: “Yum, this food tastes really good!”
And that’s because, at the end of the day, tools are just tools. You have to learn how to use them…
But that’s not the full sentence. It’s rather:
You have to learn how to use them so you can build useful things with them…
And that’s still not quite all.
You have to learn how to use them so you can build useful things with them that will have a positive impact on your business’s bottom line.
Maybe it sounds obvious written down. And if it is for you, that’s great. But I see many data professionals choose to focus on fancy data science solutions over the data science solutions they actually need. And then they hit a wall.
Unpopular opinion: most data scientists won’t need to know anything about deep learning
Let me give you just one example: deep learning.
I run a data science blog where I publish tutorials for aspiring data scientists on topics like the basics of Python or the basics of SQL, and so on.
And I get this question every week from someone: “When will you publish a tutorial on deep learning?”
And the answer is always the same: never.
Okay, I have to admit, I played around with the idea to quickly draft an introductory article on the topic… But it was tempting only for one reason: I know I’d get a lot of clicks for that article.
Most people want to learn about deep learning only because it’s popular. Why is it popular? Because it’s used for cool stuff, like self-driving cars at Tesla—and for that reason it gets a huge amount of media attention. That makes people excited and suddenly everyone wants to apply deep learning in their own projects.
But (at least in my opinion) it doesn’t work that way! A data science project should always start by defining the problem you want to solve. And once you have that, then you can choose the best tool to get the job done!
The naked reality is that in, most data science projects, there is a much higher demand for more traditional tools, like:
- descriptive analytics and reporting
- data cleaning and data wrangling
- automating your processes
- simple predictions and forecasting
- simple classification methods
I know, at first, these sound less cool than deep learning… But believe me, when you are working on a real project, they are just as exciting (if not more)! Why? Because they get you useful information a lot more quickly than trying to tackle a project with something complicated like deep learning.
The reality of a data scientist’s job
Don’t get me wrong! I’m not mad at deep learning—nor at deep learning tutorials… (Maybe only at the fact that it’s a bit over-hyped right now.) I picked it as a random example, and only to demonstrate my general point about choosing and learning the right tools.
If I’m being honest here: in itself, I also find deep learning fascinating and I know it has a lot of potential. And I hope that when I encounter a project where I really need to use it, I’ll have a chance to learn more about it.
And in fact, that’s my point!
I strongly believe that—at least in data science—the right approach is this: Learn a tool when you have to build something useful with it. And not the other way around like many people do: “I learned this tool because it’s fancy and now that I know it, I’ll find a project to use it in real life, too.”
Let’s not talk about deep learning anymore. I could list many more examples of the tools mattering less than the outcome:
- As a data scientist at a movie streaming company, no one will care whether you know all the features of sklearn’s RandomForestClassifier. They’ll care whether you can list the best movie recommendation for the user.
- As a data scientist at a small e-commerce shop, no one will care whether you use JOINs or subqueries (or both) in your SQL scripts. They will care whether you can find the one big thing they should do differently in next year’s marketing campaign to attract more customers.
- As a data scientist at a fintech startup, no one will care whether you use numpy, pandas, etc. They will care whether you can detect that credit card fraud in real time or not.
And so on…
So what to learn next?
If you’re reading this article, there’s a good chance that you already know and use a few data tools in your day-to-day job. That’s great!
The question is what you should learn next and why. Here’s the simple 4-step plan that I follow when I encounter this issue:
- Define the business problem! (What do you want to achieve? Why is it important? Why is it useful? Etc.)
- Can you get the job done with the tools you use right now? If not, why not? What’s missing? (Performance? Features? Integration with other tools? Etc.)
- If you need to learn a new tool, pick one that can fill the gap in your current toolset and learn that.
- When choosing a new tool, pick the one that will have the most bang for its buck. Rather than being laser focused on only one problem, consider the other kinds of problems that come up in your business, and how the new tool could help you with those as well.
If you use Excel but it can’t handle the size of your data anymore, maybe you should consider learning and using SQL. It’s the same 2-dimensional database system, only designed for much bigger datasets.
If you use Java for your data science projects, but the prototyping phase has gotten really slow (because of engineering time) and messy (because of the various skills of the different team members), then maybe you should try out a tool like RapidMiner that prioritizes collaboration and diverse teams.
And so on and on.
I hope that I managed to articulate my point clearly enough in this article. The hype is huge around many tools in data science—but data scientists should know better than to fall for it! You shouldn’t learn the fancy things just to learn them; you should learn the things you’ll need to use.
In other words, when it comes to data science: learn things so that you can build things.
Other than tooling, a key part of any successful data science strategy is a strong team. Download a copy of Building the Perfect AI Team to learn how to start assembling yours today!
This blog post is guest written by Tomi Mester of data36.com. Tomi Mester has been a practicing data analyst and researcher since 2012. He has worked for Prezi, iZettle, and several smaller companies as an analyst/consultant. He’s the author of the Data36 blog where he writes posts and tutorials on a weekly basis about data science, coding, statistics, and more.