Heatherly Carlson & Joo Eng Lee-Partridge, Central Connecticut State University
For many years, sports analytics have demonstrated a robust relationship between NFL draft measurements and NFL success. In particular, NFL drafts have been harnessing the power of sports analytics to predict the future value of the quarterback. However, most of these pre-draft metrics deal with the physical prowess or prior physical achievements of the quarterback.
While these measurements may provide some estimate of the quarterback’s predicted value in the NFL, we are proposing looking at other intangible variables to provide an alternate indicator of future value. Some potential intangible variables that could predict to QB success include character risk, injury resilience, psychological variables, environmental variables, adversity, cognitive ability, motivation and leadership experience.
Heatherly explores the relationship between the player intangibles and quarterback outcome measures such as number of playoff game appearances, number of years as starting QB and whether they have ever taken their team to a playoff game. They use both supervised and unsupervised machine learning to provide insights in QB success and QB intangible variables and provide recommendations for future QB drafts.
00:03 [music] All right. So I like to be interactive, and so if you want to follow along– I checked this on my phone this morning. It worked on my phone. If you’ve got a laptop or a phone and you want to work right along with me, you can actually do this. So to kind of dovetail with Scott’s introductory story there, going all the way back to 2004, I had a really visionary department chair. I was a software engineer at eBay, and I decided that I was burned out of being a software engineer, and what I really loved to do was teach. And so in January of 2004, I got a college, Washington & Jefferson College outside of Pittsburgh, to hire me as a professor. And my very first department chair was absolutely visionary. He said, “You know–” I wasn’t even Dr. North back then. He just said Matt. He said, “I think this data analytics thing that I keep hearing a lot about–” we actually referred to it as knowledge and data discovery back then, KDD. He said, “I think there’s some teeth to this. I think this is going somewhere. And as an undergraduate program, I think we should be teaching data mining. Can you teach data mining?” “Sure, yeah, I can teach data mining. Why not?” Right? [laughter] Well, we didn’t have a graduate program, and I started looking around and I realized everybody who was teaching data mining back in 2004 was doing it at the graduate level, and all of their graduate students had bachelor’s degrees in math and computer science and statistics. And I had these undergraduate students who were freshmen and sophomores and juniors, some of whom hadn’t taken the freshmen stats class yet. And he said, “Can we teach data mining to people who aren’t experts already?” And I said, “Well, I don’t know. I think so.”
01:46 So I spent the next three years working on a PhD, and my dissertation was on teaching data mining to nonexperts, to people who were smart but didn’t necessarily have a pre-built background or foundation in quantitative or computing skills. And so I did a dissertation on this. I used my students and some other students around the country as guinea pigs. And I found that we could, actually. If we had the right tools, we could teach data mining to nonexperts. And my department chair said, “We really need to get this class up and going. What are you going to do about it?” And I said, “Well, all of the textbooks that are out there are really dense and hard to access if you’re not an expert.” The best one at the time was by Micheline Kamber and Jiawei Han. Some of you may have seen that one. It’s called Data Mining Techniques. And I couldn’t see my students being able to use it. It even said in the forward to that book, “This is for graduate students.” And I said, “Well, I guess I’m going to have to write the book. If I’m going to teach the class, I’m going to have to write the book.”
02:50 So I got started working on a book, and I had several false starts because I was trying to find a tool that I could base the book on. And I tried a couple of different ones. I tried WEKA. I tried one called Tanagra. I tried XLMiner, which was an add-in some of you may have encountered before. And all of those resulted kind of in dead ends for me. And then one day, I found RapidMiner 4. It was a free download. It was from this weird group of people in Germany that appeared. There was maybe six or eight of them working on this thing. It was open source. And I downloaded it, and I started fiddling around with it, and I was like, “Man, this is cool. This works really well.” And the thing that I liked about it the best was that you could drag and drop things to build your processes. There was no coding. There was no scripting. And one really nice feature was when you put it together and you hit Run, it ran, and it didn’t give you a buffer overflow or something. I was like, “This is amazing.”
03:50 So I found myself becoming more and more of an aficionado and even to some extent an expert in RapidMiner 4. Then they released RapidMiner 5, and it was even better. And by that point, it was almost 2011. And I worked the whole next year writing a textbook using RapidMiner as a foundation for people who were not experts. And I started thinking, “What should we call this thing?” Well, I’m trying to teach data mining to people from all different disciplines, all fields, with all kinds of expertise. I’m trying to teach data mining to the masses. So I called my book Data Mining for the Masses, and I found a little publisher down in Georgia that was willing to publish it for me. And then I emailed a guy named Simon Fischer, who was one of the weird group of German guys that was working on this software, and I said, “Hey, I see you have a user conference coming up in Budapest. I just wrote a book about your software. Would you be interested in hearing about it?” And they said, “Yeah, that’s great. Come.” So I went to Budapest, where I’ve never been before, and I brought the book with me. And I wasn’t sure anyone would even be interested in the book, and so I brought one for everyone, just gave it away because I didn’t know if anyone would pay for it.
05:06 Well, to make a long story short, this is the little engine that could. [laughter] People started calling me up and emailing me from all over the world. I got emails from South Africa, from Japan, from Australia. “Hey, we love your book. Can we use it? Is it okay if we print it? Can I translate it into my language?” “Yeah, sure. Translate it into your language. That’s great.” Then I started getting complaints. “Hey, your book’s a little bit out of date. Maybe we should update it.” So in 2015, I put out a second edition. And then I started getting some comments in 2017. “Hey, RapidMiner has some new features. When’s your book going to be brought up to speed?” and I’m like, “Man, this is a lot of work.” But I put out a third edition in New Orleans a year and a half ago. Some of you were there. You got a copy of that third edition. It was the purple one. We had the green one, then the blue one, then the purple one. And in the last year or so, I’ve been getting a lot of requests from people regarding Data Mining for the Masses to enrich the book, basically.
06:04 So I started thinking, “We’re in the technological age. Maybe we ought to embrace a technological platform.” So I partnered with this company, MyEducator, and we’ve released the fourth edition as a digital e-book. It has a number of different enhancements, and I’ll be happy to share those with you in the five minutes I have remaining in my presentation so you can learn a little bit more about the book. Now, the book is definitely written– it was written by me, for me because I was the one who wanted to use it, and it’s worked for a lot of other people. But one of the things that I like about the partnership with MyEducator is, now that it’s on a digital platform, the book is much more flexible. So if you want to use it in corporate training, if you want to take pieces and parts of it– Brian told me you’ve used it in corporate training in the past, haven’t you? So if you want to rearrange chapters or put some chapters in that aren’t there and then maybe take some of my chapters out, that’s fine. You can do that now that it’s in digital format. So you can use this login and password. If you just go to myeducator.com – that’s what I’m going to do right now – I’ll just show you a few of the features that we’ve added. So Scotty Pectol is the director of marketing and sales at RapidMiner, and he set this up so that any of you can log in and take a look at it. It’s just email@example.com, which I think he just made up. And then the password is all lowercase, wisdom. And I hit Enter, and it will take me in. Hopefully. I tested the Wi-Fi earlier. Yeah, there we go.
07:32 So we have Data Mining for the Masses, fourth edition. So there’s all of the chapters. You can see them listed all here. We have a kind of a business understanding and data understanding section at the beginning. We go through several steps of how do we use things, really basic things. Remember, this is for the masses, right? So how do I use a select attributes operator? How do I use a filter examples operator? How do I take a random sample? How do I take a stratified sample? We have operators in RapidMiner that can do that. So that’s kind of the first section. Then I go through categorical modeling techniques, again, relatively basic, correlations, association rules, k-means clustering, and then we get into the predictive stuff, discriminant analysis, linear and logistic regression – let’s see – decision trees, neural networks. I do do text mining, and in Family Feud, we heard that text mining was the number-one add-in for RapidMiner, and I do do that one. Also, in evaluation and deployment, we do cross-validation. And then I have a chapter on ethics, and Ingo actually shook my hand and thanked me for putting that into the book because one of the very first sentences in that chapter said, “This is an introductory text. I have given you enough knowledge to be dangerous. Don’t be.” That’s what it says. I wrote that right in the book because I didn’t want to empower people with tools that could cause them to make bad decisions or make false assumptions or to coerce people using data in ways that people shouldn’t be coerced. And so that was a very important chapter. And, kind of, I put it at the end, not because it was last, but because it was the thing– if you took anything away from the book, I wanted you to take that away. Right? And so I kind of made that the capstone of the book.
09:18 So all of the chapters have been updated in the digital version. I made videos, which– I am just really handsome, so having myself in videos, it just– actually, I got a complaint because one of the chapters didn’t have a video, and one of the students didn’t like that. He wanted to see me in all of the chapters. [laughter] So I made a video for that chapter as well. There I am. See? Look how handsome I am. I’m like, “errr!” But there’s not just videos of me talking, but if you go into the Modeling section of each chapter, there’s also live walk-throughs. And so you can actually see me. Nope. Where is the video? It’s here somewhere. Maybe it’s on the next page or something. I don’t think it’s– oh, there it is right there. It’s under Data Preparation in this chapter, but there’s a video, and I actually do a live walk-through. Now, in response to additional requests, one version ago, I actually added how to do the same techniques and skills in R. So you can go through it in RapidMiner, and then if you want, you can actually do the exact same process in R as well, and there’s videos to show you how to do that as well.
10:34 Now, videos aren’t the only thing we’ve done to enhance it. The MyEducator platform actually has analytics to see how your students are doing. We don’t have any students in this class, so if we look at the course dashboard – because Scotty just set this up for me to be able to show you this – you can see the analytics for your class. You have total students, how they’re doing on average, what excellent kind of looks like in your class, and then are there some students that are maybe at risk, and that would be based upon their performance completing the different activities in the book. But because we don’t have any students, this is not particularly interesting, but I wanted you to know it’s there. In addition, I’ve also added some resources that would be– they would be considered typical with a larger publishing company like Prentice Hall or McGraw Hill, where you have lecture slides and you have test-bank questions and things. Those are things that I never really provided when it was just a printed book that you could buy on Amazon. But now, we’ve been able to add all of those and integrate these into the MyEducator platform.
11:38 And so you can see there are Knowledge Checks here that students can complete. You can assign them. I have two different exams plus– two midterm exams and a final exam that are there, a couple of different multiple-choice exams as well. So if you don’t want to do the type of essay-style grading where I actually have my students go in and take a test and then write out an answer and explain what they found, that’s a lot of grading. Some faculty felt like they didn’t want to do that much grading. So I told them they were lazy, and then I gave them the multiple choice anyway. [laughter] So there’s a few different examples of how to do this. If you are a teacher or if you know teachers who may be interested in using this, it hooks into all of the major learning platforms like Canvas and Brightspace, Desire2Learn, Blackboard. It hooks into all of those automatically so you can actually push your grades.
12:30 And the next thing that I’m working on with MyEducator now that we’re on this platform, aside from trying to keep it current with all of the new releases that RapidMiner puts out, is where we can have actual homework assignments, where they will go in and build a RapidMiner process and export it as an RMP file and upload it, and it will automatically assess whether or not they built the process correctly and generated the right results. So that will help with reducing the grading load as well. And we’re working actively with the software engineers at MyEducator to be able to deploy that before the end of this year.
13:05 So that’s kind of where things stand right now. And like I said, I built this because I wanted it to be a gateway for people to be able to get into doing data analytics. I wanted to give them something that was a low barrier of entry so that more people can do this type of analysis. So if you think it’s a resource that might be helpful to you, even if you’re not an educator, you don’t work in a university, by all means, ask me for a business card. I have my business cards right up here. I also have some from Scotty at MyEducator if you want to contact him directly. He’s very happy to work with people. And – what did I do with them? – I got a huge stack of flyers as well, so you’re welcome to grab one of these. I’ll also leave them maybe at the front table in case other people who couldn’t attend this session want to grab one. So they’ll be out there. And I guess that’s it. Scotty, did I run out of time? Yeah? [laughter] Any other questions? We good?
14:01 All right. Rock and roll. [music]
AN EXAMINATION OF THE NFLs QUARTERBACK SUCCESS
00:04 [music] So good afternoon, everyone. I want to talk a little bit about how we decided sort of where our dataset was going to come from. So in the NFL, most of you maybe‐‐ show of hands, anybody watch the Super Bowl about a week ago? Okay, so we have some fans. Most of you know that when the NFL Super Bowl happens, that that determines sort of which teams draft in what order. So the teams that get eliminated from the playoffs first get to sort of pick first, and then the last team to get eliminated or the team that wins gets to pick sort of last in every round. So when some teams are doing really poorly in the season, they actually kind of go over to their draft mode knowing, “Okay, the playoffs are not going to happen for us. Perhaps it’s best if we just don’t win anymore because that will sort of increase our chance of equalizing when the draft comes.” So the teams that go to the Super Bowl like Kansas City and the 49ers, when they’re done with that game, they’re actually kind of five weeks behind the other teams, right, because they didn’t get a chance to go to the Senior Bowl. They didn’t get a chance to sort of look at the players longer. They were so busy preparing for the playoffs that they’re just sort of behind.
01:21 Our interest in the NFL is sort of trying to find out, if all the teams go to the NFL Combine, and they all watch the drills and the player workouts, and they kind of are there for all of the interviews, how is it that all 32 teams get the same data set, in a sense, but there seems to be sort of a competitive advantage? Meaning some teams are masters at drafting while other teams don’t really do that well. And how does that work? I mean, how is it that some teams have that sustained competitive advantage, while others struggle? So we decided to take a look at it in terms of intangibles versus tangibles. So these tests here‐‐ we’re going to concentrate on the physical tests, and we’re going to call those the tangibles, right?
02:14 So at the NFL Combine, players are basically poked and prodded and measured, and they put them through a series of tests. So we’re just going to run through these so that you can recognize them when we present them in the regression model and the decision trees. So there’s the 40‐yard dash, the vertical jump, the broad jump, the shuttle run, and the three‐cone drill. There are other ones, but we’re only interested in looking at the quarterbacks because asking the question, “What advantage do teams have?” Is so broad. It’s so hard to parse it down to all the possible permutations of coaching trees and strategies and that kind of thing. So we’re literally just going to refine our NFL look to just the quarterbacks. And I’ve only put up the ones that the quarterbacks compete in. So, for example, they don’t compete in the bench press, for example. The linebackers do, but not the quarterbacks. So on February 23rd, there’ll be 337 prospects that are going to attend the NFL Combine this year. And we’re interested in the intangibles. So the intangibles are as follows. The yellow highlight is sort of our impression of, “My goodness, if I could get my hands on this intangible.” I’m not‐‐ I don’t know anyone. Maybe somebody knows somebody who would have that medical grade. But that’s basically the data point that we’re interested in. We want to know what is on the line. When you select a player, how likely are they to get injured? How many years are you going to get out of them? Are you only going to get a few seasons? What is the risk and what are you getting? Sort of that cost‐benefit ratio.
03:57 The other variable we’re interested in, and there’s a multitude of ones‐‐ again, we couldn’t go back and ask the players to assess themselves, but we were interested in character risk and we really couldn’t find anything that sort of mapped onto that. It’s very hard to pull together their college transcript in terms of how they did on the team. Were they kind of coachable? Were they high‐risk in terms of their character? So we just kind of skipped over that and went straight to the psychological, mental health intangible. So the Wonderlic, for those of you who don’t know what this is, is sort of the IQ test. It was, I think, developed in the ’30s. And most people probably know it as having been used by the military to sort of figure out and grade personnel for World War II, trying to figure out who was going to be a pilot. This is just a sample of the battery; there was actually the 50 questions. And the way that this is scored ‐ and they’re under duress for how long they have time ‐ is they literally just get one point for every correct answer. So you have to move along at a very fast pace, at about four questions a minute, or you literally won’t get exposed.
05:13 So then getting back to the actual RapidMiner environment, now that you’re familiar with the Combine data, we also took in data from their actual NFL debut. And we didn’t really know exactly where to start looking at it because we have some quarterbacks that have been around for 20 years, and then we have other quarterbacks that are in our data set that have only been around a year. But just to sort of put them on a level playing field, we kind of took a cut off, like, “We’re going from the NFL draft from the year 2000.” So that was sort of our starting date, so that we took only the quarterback subset. We inputted all values we could find for the Wonderlic, and we also inputted an injury value as seen that was similar to the one you saw, but it wasn’t the actual NFL expressed values. So when we ran our model, we looked at classification first. But there were problems with our data. First, we found out that half those players in our data set hadn’t actually been drafted. So we kind of wondered, “What do we do with these players?” We decided, ultimately, to impute their draft pick number, which on average for quarterbacks, was going to be a little earlier in the draft than later. So we imputed that with the average. And we also found that the probability of injury value that we took from, I think, sportspredictor.com‐‐ sportsinjurypredictor.com, sorry‐‐ we found that those values were missing. So we had to impute those as well. So a lot of limitations in what we were trying to look at.
06:51 When we ran the supervised values through the classification, we found that‐‐ sorry. Here we go. We found that about 95% were optimized using the Gini index for accuracy. And then when we ran the Naive Bayes, we found that it predicted almost as well, 92%. And then the ROC curves kind of give us a graphical representation of that. When we ran a linear regression, and this time, our‐‐ sorry, I should back up. I should tell you the label for these analyses for classification was “success.” Success was defined as reaching the postseason. So if you played in any postseason game at all as of about December 30th, if we put you into the successful “Yes” category. And if you did not play, then you were considered not successful. So irrespective of how long you were in the NFL. And then for the linear regression, we obviously had to go with a continuous variable. So we took the total count of NFL postseason games started, and we predicted to that using all of those college variables, like their passing metrics in college, and their Combine values. And we also put in the intangibles. And lo and behold, actually what came out, which was shocking to me‐‐ although, with all the imputation, you can’t take it too seriously‐‐ would be that the Wonderlich, which is really sort of my favorite thing ‐ that’s the one that I’m personally invested in ‐ actually came out significantly predicting the NFL postseason games.
08:39 And then in terms of the unsupervised, let’s take a look at the clusters. This is what our clusters came out with, so we specified we wanted five, and then we got a nice little break. And when we looked at the actual players‐‐ does anyone want to guess which cluster here was probably the best set of quarterbacks? The ones that you know, the name‐brand quarterbacks that you might be familiar with? Does anyone want to take a guess? Well, you have to‐‐ which cluster number? Who do you think the Patriots would‐‐ which cluster would the Patriots’ quarterbacks fall under? Okay, anybody else have a guess? Okay, let’s take a look. So we have this nice distribution, so the players are all in one cluster. And then when we look in three, lo and behold, you’re correct. The best quarterbacks, sort of the ones that you think of as being the franchise quarterbacks that are usually pretty guaranteed with their contract‐‐ although I know Matthew just left. Tom Brady, Ben, Drew, all of our favorite quarterbacks kind of shook out in cluster three. So that was pretty exciting. So this is just a preliminary look at our NFL success variable, but we’re hoping that when we go forward, we can optimize our model and look at different parameters and kind of see through various techniques back in our design. Using some of these subprocesses and using some cross-validation methods, we’ll be able to see if we’re on the right path as far as figuring out what are those intangibles that NFL teams seem to be able to harness. [music]