Matt North, Utah Valley University
Since publication of its first edition back in 2012, Data Mining for the Masses has become a staple gateway text for learning data science and analytics using RapidMiner. This presentation reviews the latest enhancements to the book, now published in its fourth edition as an interactive, smart textbook on the MyEducator platform.
00:03 [music] All right. So I like to be interactive, and so if you want to follow along– I checked this on my phone this morning. It worked on my phone. If you’ve got a laptop or a phone and you want to work right along with me, you can actually do this. So to kind of dovetail with Scott’s introductory story there, going all the way back to 2004, I had a really visionary department chair. I was a software engineer at eBay, and I decided that I was burned out of being a software engineer, and what I really loved to do was teach. And so in January of 2004, I got a college, Washington & Jefferson College outside of Pittsburgh, to hire me as a professor. And my very first department chair was absolutely visionary. He said, “You know–” I wasn’t even Dr. North back then. He just said Matt. He said, “I think this data analytics thing that I keep hearing a lot about–” we actually referred to it as knowledge and data discovery back then, KDD. He said, “I think there’s some teeth to this. I think this is going somewhere. And as an undergraduate program, I think we should be teaching data mining. Can you teach data mining?” “Sure, yeah, I can teach data mining. Why not?” Right? [laughter] Well, we didn’t have a graduate program, and I started looking around and I realized everybody who was teaching data mining back in 2004 was doing it at the graduate level, and all of their graduate students had bachelor’s degrees in math and computer science and statistics. And I had these undergraduate students who were freshmen and sophomores and juniors, some of whom hadn’t taken the freshmen stats class yet. And he said, “Can we teach data mining to people who aren’t experts already?” And I said, “Well, I don’t know. I think so.”
01:46 So I spent the next three years working on a PhD, and my dissertation was on teaching data mining to nonexperts, to people who were smart but didn’t necessarily have a pre-built background or foundation in quantitative or computing skills. And so I did a dissertation on this. I used my students and some other students around the country as guinea pigs. And I found that we could, actually. If we had the right tools, we could teach data mining to nonexperts. And my department chair said, “We really need to get this class up and going. What are you going to do about it?” And I said, “Well, all of the textbooks that are out there are really dense and hard to access if you’re not an expert.” The best one at the time was by Micheline Kamber and Jiawei Han. Some of you may have seen that one. It’s called Data Mining Techniques. And I couldn’t see my students being able to use it. It even said in the forward to that book, “This is for graduate students.” And I said, “Well, I guess I’m going to have to write the book. If I’m going to teach the class, I’m going to have to write the book.”
02:50 So I got started working on a book, and I had several false starts because I was trying to find a tool that I could base the book on. And I tried a couple of different ones. I tried WEKA. I tried one called Tanagra. I tried XLMiner, which was an add-in some of you may have encountered before. And all of those resulted kind of in dead ends for me. And then one day, I found RapidMiner 4. It was a free download. It was from this weird group of people in Germany that appeared. There was maybe six or eight of them working on this thing. It was open source. And I downloaded it, and I started fiddling around with it, and I was like, “Man, this is cool. This works really well.” And the thing that I liked about it the best was that you could drag and drop things to build your processes. There was no coding. There was no scripting. And one really nice feature was when you put it together and you hit Run, it ran, and it didn’t give you a buffer overflow or something. I was like, “This is amazing.”
03:50 So I found myself becoming more and more of an aficionado and even to some extent an expert in RapidMiner 4. Then they released RapidMiner 5, and it was even better. And by that point, it was almost 2011. And I worked the whole next year writing a textbook using RapidMiner as a foundation for people who were not experts. And I started thinking, “What should we call this thing?” Well, I’m trying to teach data mining to people from all different disciplines, all fields, with all kinds of expertise. I’m trying to teach data mining to the masses. So I called my book Data Mining for the Masses, and I found a little publisher down in Georgia that was willing to publish it for me. And then I emailed a guy named Simon Fischer, who was one of the weird group of German guys that was working on this software, and I said, “Hey, I see you have a user conference coming up in Budapest. I just wrote a book about your software. Would you be interested in hearing about it?” And they said, “Yeah, that’s great. Come.” So I went to Budapest, where I’ve never been before, and I brought the book with me. And I wasn’t sure anyone would even be interested in the book, and so I brought one for everyone, just gave it away because I didn’t know if anyone would pay for it.
05:06 Well, to make a long story short, this is the little engine that could. [laughter] People started calling me up and emailing me from all over the world. I got emails from South Africa, from Japan, from Australia. “Hey, we love your book. Can we use it? Is it okay if we print it? Can I translate it into my language?” “Yeah, sure. Translate it into your language. That’s great.” Then I started getting complaints. “Hey, your book’s a little bit out of date. Maybe we should update it.” So in 2015, I put out a second edition. And then I started getting some comments in 2017. “Hey, RapidMiner has some new features. When’s your book going to be brought up to speed?” and I’m like, “Man, this is a lot of work.” But I put out a third edition in New Orleans a year and a half ago. Some of you were there. You got a copy of that third edition. It was the purple one. We had the green one, then the blue one, then the purple one. And in the last year or so, I’ve been getting a lot of requests from people regarding Data Mining for the Masses to enrich the book, basically.
06:04 So I started thinking, “We’re in the technological age. Maybe we ought to embrace a technological platform.” So I partnered with this company, MyEducator, and we’ve released the fourth edition as a digital e-book. It has a number of different enhancements, and I’ll be happy to share those with you in the five minutes I have remaining in my presentation so you can learn a little bit more about the book. Now, the book is definitely written– it was written by me, for me because I was the one who wanted to use it, and it’s worked for a lot of other people. But one of the things that I like about the partnership with MyEducator is, now that it’s on a digital platform, the book is much more flexible. So if you want to use it in corporate training, if you want to take pieces and parts of it– Brian told me you’ve used it in corporate training in the past, haven’t you? So if you want to rearrange chapters or put some chapters in that aren’t there and then maybe take some of my chapters out, that’s fine. You can do that now that it’s in digital format. So you can use this login and password. If you just go to myeducator.com – that’s what I’m going to do right now – I’ll just show you a few of the features that we’ve added. So Scotty Pectol is the director of marketing and sales at RapidMiner, and he set this up so that any of you can log in and take a look at it. It’s just email@example.com, which I think he just made up. And then the password is all lowercase, wisdom. And I hit Enter, and it will take me in. Hopefully. I tested the Wi-Fi earlier. Yeah, there we go.
07:32 So we have Data Mining for the Masses, fourth edition. So there’s all of the chapters. You can see them listed all here. We have a kind of a business understanding and data understanding section at the beginning. We go through several steps of how do we use things, really basic things. Remember, this is for the masses, right? So how do I use a select attributes operator? How do I use a filter examples operator? How do I take a random sample? How do I take a stratified sample? We have operators in RapidMiner that can do that. So that’s kind of the first section. Then I go through categorical modeling techniques, again, relatively basic, correlations, association rules, k-means clustering, and then we get into the predictive stuff, discriminant analysis, linear and logistic regression – let’s see – decision trees, neural networks. I do do text mining, and in Family Feud, we heard that text mining was the number-one add-in for RapidMiner, and I do do that one. Also, in evaluation and deployment, we do cross-validation. And then I have a chapter on ethics, and Ingo actually shook my hand and thanked me for putting that into the book because one of the very first sentences in that chapter said, “This is an introductory text. I have given you enough knowledge to be dangerous. Don’t be.” That’s what it says. I wrote that right in the book because I didn’t want to empower people with tools that could cause them to make bad decisions or make false assumptions or to coerce people using data in ways that people shouldn’t be coerced. And so that was a very important chapter. And, kind of, I put it at the end, not because it was last, but because it was the thing– if you took anything away from the book, I wanted you to take that away. Right? And so I kind of made that the capstone of the book.
09:18 So all of the chapters have been updated in the digital version. I made videos, which– I am just really handsome, so having myself in videos, it just– actually, I got a complaint because one of the chapters didn’t have a video, and one of the students didn’t like that. He wanted to see me in all of the chapters. [laughter] So I made a video for that chapter as well. There I am. See? Look how handsome I am. I’m like, “errr!” But there’s not just videos of me talking, but if you go into the Modeling section of each chapter, there’s also live walk-throughs. And so you can actually see me. Nope. Where is the video? It’s here somewhere. Maybe it’s on the next page or something. I don’t think it’s– oh, there it is right there. It’s under Data Preparation in this chapter, but there’s a video, and I actually do a live walk-through. Now, in response to additional requests, one version ago, I actually added how to do the same techniques and skills in R. So you can go through it in RapidMiner, and then if you want, you can actually do the exact same process in R as well, and there’s videos to show you how to do that as well.
10:34 Now, videos aren’t the only thing we’ve done to enhance it. The MyEducator platform actually has analytics to see how your students are doing. We don’t have any students in this class, so if we look at the course dashboard – because Scotty just set this up for me to be able to show you this – you can see the analytics for your class. You have total students, how they’re doing on average, what excellent kind of looks like in your class, and then are there some students that are maybe at risk, and that would be based upon their performance completing the different activities in the book. But because we don’t have any students, this is not particularly interesting, but I wanted you to know it’s there. In addition, I’ve also added some resources that would be– they would be considered typical with a larger publishing company like Prentice Hall or McGraw Hill, where you have lecture slides and you have test-bank questions and things. Those are things that I never really provided when it was just a printed book that you could buy on Amazon. But now, we’ve been able to add all of those and integrate these into the MyEducator platform.
11:38 And so you can see there are Knowledge Checks here that students can complete. You can assign them. I have two different exams plus– two midterm exams and a final exam that are there, a couple of different multiple-choice exams as well. So if you don’t want to do the type of essay-style grading where I actually have my students go in and take a test and then write out an answer and explain what they found, that’s a lot of grading. Some faculty felt like they didn’t want to do that much grading. So I told them they were lazy, and then I gave them the multiple choice anyway. [laughter] So there’s a few different examples of how to do this. If you are a teacher or if you know teachers who may be interested in using this, it hooks into all of the major learning platforms like Canvas and Brightspace, Desire2Learn, Blackboard. It hooks into all of those automatically so you can actually push your grades.
12:30 And the next thing that I’m working on with MyEducator now that we’re on this platform, aside from trying to keep it current with all of the new releases that RapidMiner puts out, is where we can have actual homework assignments, where they will go in and build a RapidMiner process and export it as an RMP file and upload it, and it will automatically assess whether or not they built the process correctly and generated the right results. So that will help with reducing the grading load as well. And we’re working actively with the software engineers at MyEducator to be able to deploy that before the end of this year.
13:05 So that’s kind of where things stand right now. And like I said, I built this because I wanted it to be a gateway for people to be able to get into doing data analytics. I wanted to give them something that was a low barrier of entry so that more people can do this type of analysis. So if you think it’s a resource that might be helpful to you, even if you’re not an educator, you don’t work in a university, by all means, ask me for a business card. I have my business cards right up here. I also have some from Scotty at MyEducator if you want to contact him directly. He’s very happy to work with people. And – what did I do with them? – I got a huge stack of flyers as well, so you’re welcome to grab one of these. I’ll also leave them maybe at the front table in case other people who couldn’t attend this session want to grab one. So they’ll be out there. And I guess that’s it. Scotty, did I run out of time? Yeah? [laughter] Any other questions? We good?
14:01 All right. Rock and roll. [music]