It’s been one week since RapidMiner Wisdom. Time flies when you are both exhausted and inspired. I had a fantastic experience at Wisdom meeting with customers, prospects, partners, and users. But most of all, I enjoyed learning from all the great presentations I attended over the two day event. Here’s a recap of the presentations I attended on the first day of Wisdom.
Wisdom 2018 kicked off with a keynote from RapidMiner founder Dr. Ingo Mierswa on the benefits and dangers of automated machine learning.
Automated machine learning tools like RapidMiner Auto Model are a great way to empower more people in an organization to deliver data science. Using automated machine learning with the appropriate guard rails in place, business analysts and subject matter experts can build predictive analytics and deliver value to their organization.
But as Ingo pointed out, there are no magic wands. Before decisions are made, or models are put into production, experienced data scientists need to be involved in the process. For this to happen, automated machine learning models cannot be black boxes. Ingo pointed out that there are two types of black boxes. The first is the explainability of the machine learning model itself. Some models are less explainable than others, but many model types like decision trees are easily explainable. But there’s another type of automated machine learning black box that needs to be avoided at all costs, and that’s the process for creating the model itself. How was the data prepared? How was the model validated? How were the parameters tuned? All of these, and more, can have a huge impact on the outcome of a model and must not be hidden inside a black box. #noblackboxes
Next up was Jeff Dwyer from ezCater, one of the fastest growing startups on the planet.
Jeff is on the growth team at ezCater. He was asked to help the ezCater better understand the unit economics of their customer acquisition efforts, in particular how much year 1 revenue they could drive from various marketing campaigns.
Jeff is a software engineer at ezCater, so naturally he started by looking at Python and all the usual libraries like XGBoost and Scikit. While he made some initial progress in Python, he quickly realized that it would take a significant time investment to become productive in Python so he decided to explore visual workflow solutions like RapidMiner.
Within an hour of trying RapidMiner, Jeff produced a regression model and was able to explain the pipeline to his colleagues by walking them through the process. Ultimately, the model Jeff created helped ezCater move from simply predicting the number of new customers they acquired each week to an accurate prediction on the year 1 revenue added to the business.
Shortly after deploying this model, ezCater raised $100m in venture capital to continue growing. Some might call this correlation, but I’m going with causation 😀.
One of my favorite presentations at Wisdom came from Wes Gahagan and Abilash Gatti of Verizon Wireless. Wes and Abilash work on the customer success team at Verizon Wireless. They used RapidMiner to create an outlier detection model to improve customer support.
As you can imagine, Verizon Wireless collects a stunning amount of data with more than 32 million daily transactions. Hidden in that data are outliers that, if found, can be used to improve customer support. Using RapidMiner together with Oracle, Verizon Wireless built an anomaly detection dashboard that is used daily by the support team.
Clearing the Fog around Data Science and Machine Learning: The Usual Suspects in Some Unusual Places
The next keynote at Wisdom came from the great Kirk Borne, a principal data scientist Booz Allen Hamilton. Kirk is an astrophysicist and during his impressive career he’s worked on the Hubble telescope, at the Nasa Goddard Space Flight Center, and has taught data science at a number of universities including George Mason and the University of Maryland. If Kirk were a rockstar, he’d be Mick Jagger or Bono.
Kirk also wins the award for most selfies taken at Wisdom.
Kirk walked through a series of atypical applications for typical machine learning algorithms, including principal component analysis, graph mining, clustering and validation, association mining, and bayesian belief networks.
My favorite story is an association mining story from Walmart. Walmart studied 2004 product sales for stores in Florida that were impacted by the active hurricane season. By using association mining, Walmart could make sure their stores were prepared for future hurricanes. Walmart built an immense database system with 460 Terabytes at a cost of approximately $4 billion dollars to develop. So this was a serious project for them.
Their lesson? Of course Walmart needed to stockpile the usual assortment of survival items like water and canned goods before a hurricane arrived. But there were two items that were 7x more likely to be purchased during a hurricane than others: Beer and Strawberry Poptarts. I’ll just leave that there and move on…
Next I went to a presentation from Jose Mejias and Gilberto Graciani of Hewlett Packard Enterprise. This was probably my favorite presentation because Jose and Gilberto walked through how HPE decided to select RapidMiner as their corporate data science platform.
HPE started their selection process by using analyst reports, including the Gartner Magic Quadrant for Data Science & Machine Learning Platforms and the Forrester Wave Multimodal Predictive Analytics & Machine Learning Solutions. After reviewing analyst reports, they came up with a shortlist of vendors they wanted to take a closer look at. They then narrowed down this list by reaching out to their personal networks and asking for feedback.
Here’s what set RapidMiner apart from the other platforms they evaluated:
- The intuitive user experience of RapidMiner Studio.
- Scalability & open source.
- The ability to run data science processes on top of their Hadoop platform.
- Integration with Python and R.
- Templated use cases like customer churn and predictive maintenance.
- RapidMiner’s “Wisdom of Crowds” feature that guides users at each step in the workflow process.
As the head of marketing at RapidMiner, it’s rare that I get to hear exactly how and why an organization decides to purchase RapidMiner, so thanks Jose and Gilberto!
While the HPE presentation was my biased favorite from Wisdom, the presentation from Maggie from Clarkston Consulting absolutely blew us all away.
Most presentations at data science conferences end up deep into the weeds of algorithms and philosophical debates about the best machine learning techniques. We had a few of those sessions at Wisdom, and to be honest we probably should have had a few more.
But Maggie reminded us that 80% of data science projects have nothing to do with data or math or algorithms. According to Maggie, the hardest part of a data science project is often asking the right question. Otherwise data science projects often turn into slow moving, low impact “curiosity support” projects. Starting with the right questions ensures a clear understanding of the objectives and success criteria.
Another important lesson comes at the end of a data science project when you are ready to deploy a model into production. Often data science teams get too hung up on fractional improvements to model performance, when the lift improvement is already much better than the current baseline.
As Maggie put it “all models are wrong, but some are useful.”
Rodrigo Fuentealba Cartes of The Pegasus Group is a fabled RapidMiner unicorn 🦄 and he also wins the award for the longest travel to attend Wisdom coming from Chile.
Chile is the world’s 2nd largest farmed salmon exporter, generating over $4.5b in revenue. But the salmon farming industry is constantly threatened by sea lice, a deadly parasite that damages salmonids and has a significant financial impact on the salmon farming industry.
Rodrigo uses RapidMiner to understand how the parasite is spread, including building a 4D representation of the ocean in time-series format. By using this model, he can predict which salmon farms are at risk of parasite infection. One of the models he has created is a manually trained decision tree that categorizes the threat level between 0-10.
Using this model, he expects to help salmon farms save over $24m by 2019.
We put Michael Gloven of Expert Infrastructure Solutions in the unenviable position of delivering the last presentation of the day. But Michael delivered a fascinating session on water pipeline distribution systems to a packed room!
There are over 1+ million miles of water pipeline in the United States, and most were installed in the mid-1900s. So it’s not surprising that there are 240,000+ water main breaks a year, causing a loss of 6 billion gallons a day. Upgrading the US water infrastructure will require a $1 trillion investment over the next 25 years.
That’s where machine learning comes in. Predictive models can help prioritize the infrastructure investments… what to replace, when to replace it, and how much to spend. Michael’s presentation went into the details of his analysis in great depth.
But what made us most proud was his usage of two of the newest capabilities in RapidMiner Studio, Turbo Prep and Auto Model.
Michael started his analysis in Turbo Prep, brought his data into Auto Model to prototype models and form his initial hypothesis, and then ultimately exported the results into RapidMiner Studio to build his production model. #noblackboxes
Thanks so much to everyone for your fantastic presentations. It was obvious that you put in a tremendous amount of work into them, and we’re humbled that you chose to spend a few days with us. It’s unfortunate that I couldn’t be in all of the sessions but luckily you can check out all of the decks here.
Next up, my recap of Day 2.