Thoughts on the 2017 KDNuggets Poll on Data Science Tools

Are you a data scientist and do not know  How is this possible?  Ok, go there right now, add a bookmark, and make this part of your daily reading list.  But don’t forget to come back here afterwards to read the rest of this post.

KDNuggets is one of the most popular portals for data science, and is a great source for news and information.  It probably will not be winning a design award any time soon.  But the rich, deep content is why you will go back over and over, and that’s what really matters.

I spend more time on KDNuggets than usual in May, because that’s when the annual KDNuggets poll What data science solution did you use in the past 12 months? comes out. Gregory Piatetsky-Shapiro, the editor of KDNuggets and one of the best-known data scientists in the world, has been doing this poll for 18 years.

Gregory just published the results for 2017, and about 2,900 people have shared their software preferences for data science tools. And as always, there is a lot to learn from those results.

What’s new in data science in 2017?

First things first: RapidMiner was again voted as the most popular general data science platform and this is all thanks to our user community!  33% of all voters said that they are using RapidMiner, which is an amazing result. Many thanks to all of you!

But we know that data scientists are using up to 6 different tools in parallel so besides RapidMiner, what other tools are people using?

Let’s start with the programming languages. It should not come as a surprise that R and Python are the two leading languages for data science.  This year, Python got slightly more votes than R which might not be a significant difference really.  But in general Python has shown the bigger growth rates in the previous years, and I would not be surprised to see Python to take over the leading position over R in the future.  And then there is of course SQL, which made the third place among the programming languages.  SQL will of course never die, so no surprise here.

Connected to Python growth is Anaconda, a Python distribution with package management. Big shout out to our friends at Anaconda for growing that quickly!

On an infrastructure level, Apache Spark was used by 23% of all data scientists but Hadoop only by 7%.  And while we are talking about big data, the library MLLib only was used by 5% and hence much less than many other options.  To be honest, this was a bit of a surprise to me.

Deep Learning is all the rage

Yes, I am guilty for not playing along with the crazy deep learning hype of the past few years.  After all, the technology is much less innovative than most people believe. But I will admit that there is a strong growth trend around deep learning in our field.

This year, more than 32% of all data scientists said that they are using deep learning, up from 18% in 2016 and 9% in 2015.  Doubling every year is impressive growth indeed.

There are now a dozen or so deep learning libraries.  The most widely used one of course is Google’s Tensorflow, now used by 20% of all data scientists.

RapidMiner’s history with the KDNuggets poll

I view this poll a bit like a sporting event. It won’t make or break a vendor, but I at least take it serious. I think all vendors should take it seriously, and it looked like more vendors did this year.

The history of RapidMiner in the poll is interesting as well.  In 2006, our co-founder Ralf Klinkenberg was already why YALE was not an option in the poll (YALE was the former name of RapidMiner, and an acronym for “Yet Another Learning Environment”).  Who could know that only 11 years later machine learning would be all the hype?

RapidMiner was first included in the poll in 2007, and YALE was the most widely used open source platform from the start.  But some of our commercial competitors like SAS and SPSS were ahead of us back then.  But thanks to our loyal community and user base this changed quite quickly.  In 2008, we ended up just shy of SPSS Clementine (which later became SPSS Modeler).  We remained in the top 3 for a couple of years, and during that time other open source solutions like R started to gain more traction in the poll.

Starting In 2011, RapidMiner took over first place among all data science platform tools, and we have been able to keep this position since then.  One of the great things, however, is that data scientists now have many different approaches and often mix and match the different solutions.  There are clearly leading data science platforms like RapidMiner and in addition we have two great programming languages for data science as well, namely R and Python.

And then there are dozens of libraries like MLLib or Tensorflow, most of them accessible through RapidMiner as well.  So, you will be able to find the right tool for your problem and this is a wonderful situation to be in for data scientists.  Compare this to software offerings in the earlier years of this poll (check out the links above).

It’s a great time to be a data scientist indeed!

  • Jimmy Moon

    Great insight and the frank talk which didn’t make me feel sort of marketting.

    By the way, you mentioned using RapidMiner
    with Tensorflow. Could you tell me where I can find materials about that?

Leave a Comment