David Weisman, Ph.D. is the real deal. He’s a data scientist with over 30 years of accomplishment in IT consulting and computer science. His current projects include text and social media data mining, churn prediction, customer segmentation, and enterprise-level strategy for predictive analytics. His clients include consumer product companies, large consulting firms, technology platform providers, financial service firms, and biotechnology companies. David Weisman has been working with RapidMiner for several years and most of his work is in consulting, using RapidMiner to build out predictive models and to do data analytics for clients across a number of different industries. We recently asked him a series of questions about his ongoing work with RapidMiner products and services.
David, can you tell us what some of the recent challenges are that you face as a consultant trying to do data analysis for your clients?
As a consultant, I live off of my laptop. I also have a desktop machine in my office, but there’s sort of a fundamental requirement of running large computational predictive analytical models and that is that they take a lot of processing power. On a laptop, that really means a lot of battery time, so I can have long battery life or I can run big models, but I can’t do both. And so for me, a fundamental necessity is the ability to run these on a larger server infrastructure – and that has been difficult to manage and do on my own.
What are your initial thoughts about RapidMiner Cloud?
I’m so happy to hear about RapidMiner Cloud which will allow me to run processes remotely. In reality, I could either buy out a lot of hardware, manage it myself, manage that whole infrastructure myself, or, I could deal with these peak load requirements that I have and run them on RapidMiner Cloud! Quite frankly, my usage of hardware can run in bursts – it has very large peak loads but the average utilization is low and so a Cloud infrastructure really is a perfect model. I can buy as much time as I need for those peak loads and at the same time the average cost amortized over time is extremely low.
Can you tell us about some of the use cases where you are using or plan to utilize RapidMiner Cloud?
Right now I’m working on customer segmentation in healthcare and we’re taking a large collection of information about patients and trying to cluster them into different groups based on their lifestyle behaviors. The nature of this sort of segmentation is that you build many models while trying to put your finger on the model that best segments this group of customers. So there’s a great deal of computational work needed. We do lots of different types of preprocessing, different clustering algorithms, different parameterizations of the clustering algorithms, different feature selection, cluster validation and so forth. So it’s not like you just press go and you get one answer out. It’s that you have to iterate many, many times over this process and this is exactly where RapidMiner Cloud is valuable. I can now take these different iterations, run them concurrently, use a very high peak load of computing power, and not have to worry about it running on my laptop or my desktop machines.
You’ve shared with us in the past the collaborative nature of RapidMiner, particularly as it related to a use case for a financial services client of yours. Perhaps you could tell us a little bit more about that?
Sure. I worked closely with a RapidMiner client in financial services and they wanted to build a churn model but they were really just getting started in data science. Prior to this project they had done lots of reporting and they had people who could build out dashboards and things like this, but there was really no predictive analytical capability in this large firm. And so we worked on a number of different concepts. The one I want to focus on was a churn model of their clients. We didn’t know what type of data existed. We didn’t know how much coverage there was, what sorts of data there was. We gathered data from many different stove pipe systems and aggregated it into RapidMiner and so I had this very tight, iterative feedback loop with the client. We sat there with RapidMiner and we looked at data and they easily understood it. We built out some initial predictive models and we built out decision trees. Part of what was very interesting was to watch these business experts at the clients looking at what came out in the decision tree and they said, “That makes perfect sense.” And so part of that process was facilitated by RapidMiner in its collaborative aspects. But the fact was that we could do visualization in a very ad hoc manner, look at data, do simple cross tabulations to make sure that we were going down the right paths. It worked out really nicely and when I left that client they were able to use RapidMiner after about an hour instruction on my part. I showed them how you take a dataset, bring it into RapidMiner, run a model, look at the decision tree, and print the decision tree. They can now act on those results.
And speaking of the collaboration, what does RapidMiner Cloud mean to you in terms of collaborating with your clients?
Well in particular, it means that I can build out models, run the models in the Cloud and then because RapidMiner Cloud has connectivity to Amazon S3, to Dropbox, to Mongo, and Cassandra, I can move the data back and forth very easily between my analytical infrastructure and my client’s data stores and that’s an enormous win. Also, I can run through RapidMiner Cloud any regular SQL queries as well. So there are lots of ways from a collaborative aspect that RapidMiner Cloud satisfies my need.
You mentioned connectivity to no-SQL databases like Cassandra and Mongo DB. You’re referring to some of the new capabilities of RapidMiner Version 6.1 but another feature of 6.1 is the Wisdom of Crowds. What do you think of that capability?
Wisdom of Crowds is especially important because I’ve been using RapidMiner for a couple of years, and I’m probably one of the people who contributed to the answers found in Wisdom of Crowds, but at the same time it’s really fun to watch it to build out a process in RapidMiner; watch the Wisdom of Crowds window make suggestions and it gets it right! Another nice feature is that the accuracy will grow over time as that crowd gets bigger and the accumulated wisdom gets more robust.
Another service is the RapidMiner marketplace where the community can develop, share and reuse extensions with one another. In fact, you were telling us earlier that you’ve even created your own extension and will be contributing it to the marketplace in the future. Can you talk a little about this?
The marketplace is very rich. You can go into the marketplace window in RapidMiner, query for something you need and find the extensions that people have written. For example, there’s a recommender extension and there’s an R extension that I’ve used very heavily to integrate R and RapidMiner and I’ve also written an extension that I demonstrated at RapidMiner World this summer that does prescriptive analytics. And so the ability to run these extensions is very straightforward. It’s just Java programming, following RapidMiner’s API. I had never done one before and in a couple of days, the roots of it were up and running. It’s not challenging and you can essentially so anything with it. The RapidMiner user just sees traditional RapidMiner operators – these little blocks – and you connect them together with the wires and all this magic happens behind the scenes. And so I see the extension marketplace as a real growth opportunity for RapidMiner customers.
What’s your advice for somebody evaluating RapidMiner and RapidMiner Cloud and thinking about onboarding this platform?
I think in both cases it’s really an economic argument. If you were to compare RapidMiner with other data mining technologies, I find the real value-add is the usability. I don’t have to start writing software to do what I need. The fact that I can just point and click and build out a model and validate the model is an incredibly fast means to prototype data mining. In terms of the cloud, it’s also an economic argument. I think people should look at how much time they would spend building out an infrastructure to run these large peak load models and also to compute how much effort it is to maintain those infrastructures that would include keeping the software up to grade and keeping security patches applied. All of these ongoing costs can be turned into dollars and then you can compare those dollars against the subscription price for the Cloud and see which one nets out better. I know that in the analysis I’ve done that the Cloud worked out, it was a huge win.
This may seem like a loaded question, but what does RapidMiner mean to you?
That’s actually an easy question. It comes down to a usability and speed of getting work done. If I were to compare doing data mining on RapidMiner with some other traditional programming in R or writing in Python (and I’m very comfortable using those) it would take much longer to develop. I could either write some code in R and it might take me say, a day, or I could do a little bit of prototyping in RapidMiner that might take me an hour, so it really gets again back to that economic opportunity cost. My clients are not measuring my value by how much programming I do. They’re measuring my value-add by how much I can deliver. If I can get eight times the amount of work done in a day of RapidMiner that I can in programming, that’s an enormous win for me and it’s an enormous win for my clients. This is true for people working in small and large companies as well. Their performance objectives as employees is to build up good models and to do data analysis. It’s not to work on the minutia – it’s on showing results. And that is the compelling value-add for RapidMiner.