Getting Data Science Projects Over the Finish Line

Brian Tvenstrup, Lindon Ventures
Cody Lougee, Vantage West
Martin Schmitz, RapidMiner
Paul Simpson, Elliott Davis
Moderated by: Michael Martin, Information Arts

Data science projects often have participants with a range of skills and backgrounds. This means a given team will have a variety of different ideas and expectations about what to do and how to work together.

This panel brings together these different perspectives—including domain experts, data scientists, business users, and managers—to discuss their differing perspectives and how they can best to work collaboratively to ensure that data science projects have the desired impact and avoid contributing to the Model Impact Disaster.

00:05 Okay, welcome everyone to this paneled discussion. We’ve got a great group of really experienced practitioners. And we’re going to discuss a few points that I may bring up, a few points they need to bring up. These are really seasoned experts. And some of the things we could consider is how are data science projects really different than other types of projects? Do we ever really cross the finish line in a data science project? What lingers on even after a model has gone operational and even after it’s hopefully making money? How do we deal with organizational anxiety? How do we help organizations understand what will change and what stays the same, posted data science implementation? etc? How do we help keep organizations that we’re working with focused and on message, etc? So let’s go to our panel members. I’ll let each one of them say a little bit about themselves. And let’s just kick this off. We’re going to leave some time at the end for questions. If any one of you would like to share your perspectives or have any questions you’d like to ask of us, please. So why don’t we start with Brian? Off we go.

01:19 Hello there. So just by means of quick introduction, I’ve been doing data science and analytics for over 20 years, because we didn’t call it data science back then. It’s just analytics, or maybe even statistics or applied statistics. And that’s my academic background. I worked in a number of different industries and financial services and healthcare and marketing. And then, four years ago, started my own little boutique consulting company, just focusing on data science and kind of related business strategy type of projects. And what I’ve enjoyed over the many years is working on a number of different types of projects, really all throughout the the life-cycle of data science projects and working with both business and technical users and seeing how many projects succeeded, also seeing how many projects failed, or failed to meet the expectations that were initially there. So hopefully, we’ll get a chance to chat a little bit about that on the panel today, so.

02:20 Hello, my name is Cody Lougee. I’m at Vantage West credit union. I’ve been doing data science/analytics for probably about 15 years now in different capacities. Over at Vantage West, I manage a BI team, in charge of reporting. We’re kind of in the early stages of analytics as a whole. I’ve been in the company for about a year, and kind of leading the data science side of things. So that’s been a lot of doing the initial valuation of different vendors and chose RapidMiner, obviously. And then a lot of it has been working with leaders and kind of getting buy-in, understanding with them, getting projects kicked off. We have one project in production, but helping them understand what data science means and what modeling means. And it’s not just a black box, and just trying to kind of bring them along on the journey, make sure that they understand, make sure that they have a say in the project as well.

03:16 I’m Martin, the guy who’s always called out by every second speaker. I’m running the customer-facing data science team in RapidMiner. So if you’re a RapidMiner customer, you’re likely also to work with YY Huang, my American colleague, so I’m doing that mostly on the European side. And what does it mean? I’m working with our clients to onboard them to work on use cases. And essentially, I’m part of customer success. So I want to make our customers successful in building models, in making impact, and putting these models into production. And that’s my day to day job.

03:56 Hi, I’m Paul Simpson. I work for Elliott Davis, a top 40 accounting firm, mostly in the southeastern United States. We are just getting into advanced analytics. And that’s something that the Big Four accounting firms are doing really well as far as consulting. But a lot of our clients, they’re going to have use cases, they just maybe don’t know what they are yet. And we’re trying to flesh that out at this point. I’ve been using RapidMiner only since about September when we got it or a little bit before just to experiment. But in my previous role, we were using R and Python and we didn’t have any tool like RapidMiner we could have, and I thought we should have, so I kind of made a matrix and already had a little exposure to it. I definitely am happy to be using an end-to-end platform like this now.

04:50 Well, to get things started, first of all, how many of you out here in the audience are also actively involved in developing data science applications for your organizations? So quite a few people. So when we start, when we get things going, all of these people with their experience, what are some of the things you really look for that warn you that maybe, hey, this could be going really well, or this could be somewhat risky? And what are some of your methodologies for dealing with risk in data science projects, when you recognize different types of risks such as non-alignment or resistance from certain stakeholders? What are some of the tactics and things that each of you have found useful to sort of corral the cats and get everybody on the same page again? Anyone care to comment?

05:41 So there’s something I call the two ivory towers. The one ivory tower, there are some data scientists. They do stuff, magic. And then there’s the second ivory tower, which is, I’m in Europe, I do a lot of manufacturing. There are some engineers doing their magic. And those ivory towers are not at the same place, they don’t see each other, and everybody’s doing its stuff. Maybe I have a few more towers, like there’s like this data engineering tower and everybody has fences, and everybody’s shooting at each other. And that’s a big danger. So the biggest danger is so, I think, when people do not collaborate. There’s not a team. There’s a part of the team, which says, “We don’t want analytics,” and starts to run off without taking everybody in. Most likely, it’s a data science team. There’s a corporate initiative. We hired 10 data scientists. We took some tools, bought some cloud stuff, and then we run ahead. And those guys are forgetting to take the people who are using it later on to join forces with them or with the IT people who are taking the model then at some point into production. And they’re not working jointly on a project, but there’s only one part of an organization working on a project. And I think a platform like RapidMiner really facilitates to work with each other. So no matter which industry you’re looking at, you’ll find, we call them, data-loving people that one, Scott is always saying your data-loving people, so people who want to interact with data, who want to find patterns, who want to do things, if you give them the right tools. It’s not just the ivory tower of data scientists who want to do things, but their engineers, and auditors. There’s so many people who want to take the data, and what you want to do in a project, I’m curious for your opinion on that from a consultancy point, of course, is facilitate the different roles work with each other with their strength and together, they’re more than the unit parts.

07:57 And that’s a beautiful image of the ivory towers. And what with any of you other say, are some of the methodologies you found useful to getting these ivory towers to talk to each other? And whose role is that to really facilitate that?

08:16 One of the things that I want to agree with Martin on an underscore is the critical importance of bringing in all of the constituencies. So I personally like Chris Diem. It’s an old standby at this point, and it’s it’s general enough to be usable for almost any data science or analytics or quantitative project. But one of the early failure signs from my perspective, having done dozens and dozens of these projects over the years is when the business users are insufficiently engaged, where they might just kind of throw an idea for a data science project over the fence, so to speak, into one of these other towers, whether it’s the data, an IT side of the organization, or if they’ve got internal data analytics and data science capabilities, and then think that there’s some mysterious manufacturing process that’s occurring, then out the other end is going to come this usable product, and it doesn’t work that way. And so it’s always, I think, in retrospect, in the projects that had much less involvement from the intended end users in the business application, were the ones that were less likely to ultimately succeed. And so how do you kind of get around that? Well, you proactively make sure that you are pulling those people in. So again, Chris Diem actually starts with business understanding, how do you do that? Well, you can have several round-table brainstorming discussions, get those people in the room with the data scientists who are going to be doing the majority of the work in terms of the feature engineering and the actual modeling and the evaluation more on their own, but to get them in that room and engaging in discussion, so it’s clear what the business users are actually expecting to get out of it. And another kind of key part of that that we’ll be talking about in another context later is understanding what the business as usual is doing today. So how are things operating today? What do they expect to improve or change as a consequence of the project?

10:23 And who is going to be responsible for driving that change to do what, in other words.

10:28 And I kind of think that we take on the role of being responsible and kind of taking on all the the groups and level setting with them. It’s not just leadership, it’s also the people with the boots on the ground and letting them know that we’re not here to replace your job or anything of that sort because there’s a lot of fear and there’s a lot of unknowns from AI is going to do everything, all the way to the other end of I want to know exactly what’s going on. And just getting everybody in a room, whether it’s several different meetings or whatever, and trying to understand where they’re coming from and how they see things and explaining to them on their level.

11:04 And how have you found it, the best techniques for really engaging management in this because classically, management is sort of responsible for planning and setting initiatives and ensuring compliance. Would anyone care to talk about how do you really get management on your side?

11:24 You just put money on the table? No, if I go to Peter, our CEO and say, “Hey, this model is worth a million. It costs you an investment of 100k,” I doubt he would say no.

11:37 Well, okay.

11:39 Yeah, that’s a good point. I mean, the performance measurement within RapidMiner now that we can factor in what’s this going to cost us or benefits, then that would be great. But yeah, it is– I came to learn– my previous role was at Northrop Grumman. And it was kind of a consultancy role because we were enterprise analytics services. So we were sort of like business people. But as you know, Northrop Grumman is an aerospace company, and there are a lot of engineers. So there’s that engineer ivory tower as well. But one of my co-workers had said, I’ve learned three things that have to be there. And it turns out, they’re all about buy-in, buy-in from the the person who owns the project, and then the key stakeholders and finally, those who are kind of on boots on the ground, who are going to use whatever it is you’re doing. And that just means you got to have a lot of communication. And the people who hired us as a group of data scientists also hired a bunch of other people to do things like organizational change management, because I knew that wasn’t necessarily our forte. So a lot of communication, I know that’s what it’s going to be every time.

12:52 And another thing is you don’t have to do everything at once. You can ease into it, kind of like an agile approach of kind of iterate and release something small, something that’s not millions of dollars, something that they’re confident about, like if this does go bad, it’s not going to impact us, and then just build up the confidence and then you can start growing into the larger things.

13:13 Or is there anything really inherently different about data science projects, though? Do we find– I mean, when we look at the usual dynamics, I’d be curious for each of your thought – I had a few thoughts on that – but really interested to hear from any and all of you, is there anything inherently different that just because of that we need to really think differently about?

13:41 Yeah, you’re saying Chris Diem. That’s something that’s just foreign concept to people who are in web development, like I used to be prior to this. All those loops back. It go back to the very beginning for business understanding that I see.

13:59 Yeah, the iterative process of it is something that might distinguish it from many other types of projects. I think another critical one is in– and again, this has been discussed at length today already in the keynotes, the interpretability gap, because in many cases, a data science project is going to be replacing some element of human judgment. And things that are being done manually, things where there are domain experts who have strong perspectives on the ways that things should or shouldn’t be done. And in order to get the buy in that’s necessary, you can’t use blackbox methods just to, “Oh, trust us, this will work for you.” So you have to put a lot of thought into how you’re going to cross that interpretability gap in order to gain the confidence that’s necessary for people to actually implement the data science projects so they don’t end up like the ones that Ingo was talking about that just kind of sit on the shelf. They were proof of concepts, maybe had something promising but it never actually got implemented.

15:00 And this is really the one of the driving differences between BI projects and data science projects. It’s kind of complexity. So when I was applying for RapidMiner six years ago or so, I was also thinking about joining like some business intelligence company. Then, I look what business intelligence is, and I thought it’s counting. That’s what they do. They build sums and averages. And the craziest thing they may do is standard deviations. And I thought that’s boring. I was ignorant. I was very ignorant. While being at RapidMiner and working with enterprises, I’ve realized that a lot of the problem is not how you count. It’s not how to build the linear regression, logistic regression, random first, you name it, but to get a project into an enterprise, to get end-to-end workflows, where you get data from sensors, they’re transferred by a Kafka into a data warehouse, and so on. And in this complete pipeline, and this is something we as data scientists share with business intelligence people. And I was super ignorant back then. But I had a PhD and I thought it was cool. But the difference is, it was the result of a BI project. It’s one dashboard. It’s two dashboards. It’s an email notification, and something where I can turn around and go to my head of marketing and say, “Hey, Scott, this is the result.” And he said, “Oh, that’s great. I can get that.” If I do that with a neural network, you might say, “What?” And then saying, “Look, we got this wrapped around Academy. Do you want to take a two days course and they explain you how this all works?” is a pretty harsh statement to shareholders to say, “Do a two-days course, and then you got the understanding.” And this is, I think, a big problem with complexity, sometimes also not needed complexity. And we saw a lot of talks today, I think, saying, “I went for the linear model over the deep learning model because I can look into that and explain that and stuff.” But this complexity, then enforces interpretation, enforces that a lot of people who are not used to machine learning, but they’re used to human learning, which is a huge difference. I’m not supposed necessarily to always understand what the method is doing. And yeah, this complexity is a big problem and differentiates us as data scientists from data engineers, BI people, data warehouse managers, and so on.

17:44.Yeah, that really seems to me that one of the really– also, following in the lines of what everyone’s been saying, is that by nature, these types of projects are very interdisciplinary, particularly when we consider now that we really want to hear the voice of the customer, we want to measure customer engagement, we want to measure the value of that engagement. And so that as Martin was saying, we’re no longer in this read-only world with averages, and when we really get wild, standard deviations. But we’re now bringing in data from what used to be really separate silos within organizations and the expectation of management is as those silos will be broken down, we’ll be able to bring in metadata from many different parts of the organization and of course, have external metadata coming in from external stakeholders and partners. And then the events or the– sorry, the results are not only going to be shared inside our own four walls, but potentially with external stakeholders, and even sometimes with our customers. So that sort of the interdisciplinary nature of these projects can introduce a tremendous amount of complexity. But that’s one reason why there appears to be so much interest and so much promise in successful applications of data science precisely because everyone recognizes the value that’s there from doing that. And I’m just curious if anyone would like to share any thoughts on how they’ve best managed these types of complexities, how they managed the risks involved with these complexities, any pointers that any one of you would care to bring up?

19:25 Well, specifically to the data point that you were just making, data privacy and data compliance is a huge area of concern and getting those things hammered out and well understood again at the beginning of the project as Chris Diem puts it there right up front with business understanding, the data understanding is critical. I’ve been involved in projects that you spend several weeks and then later found out that there were some insurmountable compliance objection to bringing together the different data sources that were going to be required, or just technical objections that the data is separated out in a particular way, precisely to prevent people from being able to marry up who’s over here in dataset number one with this other data that’s over here in dataset number two, which is exactly what the data scientist wants to do. So, again, trying to understand what those objections might be early on in the data understanding process, and then laying the groundwork to make sure that that data is going to be in place before you spend a lot of time and energy on some subsequent phase, that’s critical.

20:37 Yeah, I think anybody that comes into data science from the software development world, you need to get a new mindset. And first, it’s okay, that you don’t do everything yourself. In fact, you’ve got to– that iteration or back and forth, I mean, you might only get into data understanding before you have to go to business understanding again and again and again, because that’s how you get the buy-in. And your better understanding of it, too. I don’t know, I’ve developed software projects before where I thought I understood, this was what they wanted. But as soon as you showed it to them, they said, “Oh, feature creep. There’s all these other things that we would like in here, too.” So I can imagine it’s probably 10 times worse with the data science, at least in a different way because now they’re saying, well, that’s not what we wanted at all. That’s not our real problem, because maybe they hadn’t thought through what their problems were.

21:31 And how do we help management prepare their organizations for data science because it seems that– or I’m curious, for the reactions of all of you, when you’re engaging on projects is management prepared? I mean, end users may be or line of business users may be saying, “We need to do this. This is a business need. The competition is creaming us. We’re vulnerable,” or a CEO might say, “There’s a lot of risk in us losing our market share.” But are you finding management prepared to engage on projects? And if they’re not, what can we do to actually help them so that they can do their fair share to clear the ground for us so that when we go in and perform our duties, it raises our chances for success? What do each of you think about that?

22:27 I think that they, they’re definitely want to and they’re there, but they don’t understand the scope of the project. They don’t understand all the teams needed to be involved. They don’t understand that the business knowledge needs to be there, that the data needs to be there, and you have to get a lot of data and you have to talk to a lot of different teams, because not all your data is in one spot. And that’s much more complex than they initially thought. But once you bring them along on the journey and explain that to them, I think that they’re very willing to understand the scope and how long it’s going to take, and they back it, especially when they understand the potential behind it. So just kind of walking them through that, explaining that to them, making sure that they’re along on the journey and they’re not just the end result, “Here’s what we’re trying to do.”

23:09 So lots of patience.

23:14 I actually have seen just as many pitfalls really on the other side, which is management that’s all gung-ho for data science, but they don’t really know what that means for their organization. And there’s kind of two different flavors of this. The one is a senior executive who says, “Oh, we need to be doing data science. But nobody has really thought through what the use cases are for their organization.” One of the things I’ve done as a consultant is basically come in and then tried to help a business come up with usable use cases, because they have a directive to data science, but everyone is looking around saying, “Well, what does that actually mean for us?” And so it’s not lack of executive support at that point. But it’s lack of definition around what data science actually means for that organization. And the related flavor, which was alluded to or discussed in the second keynote this morning, is the understanding by a lot of executives that data science is kind of like software development, where it’s really just a matter of how much time and how many resources are you going to throw at the problem but there is a final solution. It’s not always that way with data science, right? You might have a desire to have a model to predict a certain thing, but you’re not able ultimately, to build a model that’s going to predict it, at least not above whatever meaningful threshold it is for the business. And so not being committed to a certain expectation of performance before you even get into the parts of the project where you’re going to be able to assess that, that’s a big risk in data science projects, I think.

24:47 And do we ever finish– do we ever cross the finish line in data science projects? I mean, if we look at our title, getting them over the finish line, I suppose we can interpret as saying, “Well, they’re deployed and hopefully making money. But is that actually the finish line?” And how do we– what could we be doing about that? How do we contribute to really getting to the finish line?

25:11 Kind of. So I think most organizations don’t have the resources to say we invest over and over and over again in one project. I mean, it’s saturating over time, how much the additional gain is, so. In the beginning, I need to learn about the data, about business understanding. And then I build a model and the model is mediocre. And then I build another model and run another iteration of Chris Diem, getting better and better and better in terms of hopefully business outcome and area under the curve. But then it’s at some point saturating where if I invest yet another cycle, I don’t get that much anymore. And if you go to Kaggle, then you find people who do it over, and yet another iteration, yet another iteration, yet another iteration, because they are hunting for the 0.01 more in accuracy. And then the question is, what do you call finish line? And I think the reality is simply that you need to assess first, am I good enough for deployment? Does my model make increased revenue for my company? Does it reduce risk or prevent costs? And then the second question is a management question, do I invest yet more time into that model to get another 1%? You need to basically, the data science team needs to evaluate, okay, if we invest yet another week on this project, how much will be the gains we get, hopefully, money from this week? Or do we turn around, look at our pipeline of projects, which are hopefully identified before, and say, okay, we go for the next project in pipeline, because getting a very decent model, maybe not the perfect model on the next use case is simply worth more. And then the question is, what do you call finish line? Is finish line good enough or so good that it produce revenue and it’s worth more to spend time on another use case? That’s what I call finish line? And then I would say, yes, we reached some finish line, even though you can always spend more time on it. See Netflix Recommender, I think they spend still time on making better, and that’s why Netflix is successful.

27:28 And in your various projects that all of you have done, how have you found– what’s the best way to help the customer that you’re working with actually be able to take over maintaining the model themselves to literally hand that over, or is it something that’s part of what you typically will then do? You will maintain it for them? How do we empower customers? I’m thinking this might be the last question before we go to questions from the other folks here? How do we help customers actually become self sufficient so that we’ve taught them how to fish, so to speak, in terms of maintaining what’s been built for them? What do you think?

28:12 I–

28:12 For me, it’s easy, it’s very easy. My job is it to make all of you successful and good with a platform. So this is simply my job to make you self sufficient. So you don’t need me or YY or Michael or any of the data scientists or RapidMiner anymore. That’s what I call also success.

28:29 So this is the guy they call then, right?

28:32 Okay.

28:35 And as a consultant, I’m usually in the same position, it’s not my goal, or my client’s goal for me to be involved kind of in perpetuity. But the problem is, and again, it’s something that has been discussed at length already today, most companies chronically under invest and underestimate the amount of effort that ongoing model maintenance and management is going to take. I worked for FICO for some time, and that’s an organization that’s on the opposite side, there are more people– at any given point in time, FICO was just the credit scores here in the United States and globally, as well. They’ve spent way more time and way more energy, monitoring the performance of their existing scores and calibrating them and re-calibrating them and determining when exactly they need to go on and redevelop, but in most organizations, they kind of are thinking about it the way that Martin was just describing, like, “Okay, we did this project now. We’re going to move on to the next project.” And nobody’s thinking about, “Well, how exactly are we going to be monitoring the project that we just put into production? And what is the threshold at which we would then have to cycle back around and do redevelopment?” And that is a big educational step that’s required for them to then take on that responsibility.

29:42 And yeah, it sounds like we have a lot– we have the ball in our court to a great deal to really talk about these types of factors with management so that when they are planning and underwriting various projects and aligning this to planning cycles, that these sorts of things are understood all these various factors? Would anyone care to any add anything else on any other point, or bring up a whole new thread before we go to question?

30:11 Well, on on that point, your last question, gave me some things to think about because our firm is just beginning to do this kind of consulting service for people and most of our clients, if they had the capability to do data analytics themselves, then they could maintain their own model. So I think a lot of times, we’re going to be giving them back results maybe more accurate, maybe faster than we used to do it manually. I think that could be very often just more of what they used to have but, “Oh, can we predict something for you?” Now we’ve got to figure out, “Are we going to manage that and just how? We’ve got RapidMiner server, we can build apps, give them a login to it so that they can run it whenever they want to?” I still got questions about that, too, as probably a lot of people do.

31:07 I would like to say one thing, which is basically the opposite to the towers. So when you gave a talk earlier, you were talking about the digital campfire atmosphere, that you’d like to have a digital campfire where everybody’s sitting around the campfire and discussing about data science, and have this nice and cozy atmosphere to talk about it and making it better. And I think what we all need to try is in our own organizations have this campfire atmosphere, where you can discuss data science and everything which is related to it. And also, what I like about the campfire metaphor you put out earlier today is part of this is, and a campfire can always say how I failed. Everybody’s talking here– not everybody like Ingo, in the beginning said, “This is what I screwed up, basically.” But of course, every vendor and everything you’re talking about what worked. And part of the culture in data science we need is we need to talk about what did not work because doing the same mistake over and over again is pretty stupid. And I wouldn’t call that– well, maybe it’s science, you try something, and then you got insight, but then don’t do this anymore. So let’s try to have this digital campfire atmosphere as many places as possible to get better together.

32:35 I think that’s great. And I think part of that is communicating with everybody that failure happens all the time. Like you can’t be successful to every single model, letting people know that upfront, making sure leadership knows that so that way, they’re not expecting amazing results every single time. And if that first one doesn’t provide amazing results, they’re prepared. They understand, and they’re not just willing to cut ties right away.

33:05 And just to to add to that, as a consultant in general, not just in data science, they say it’s all about managing expectations. And that’s true with any data science project. One of the exercises that I like to do on client projects sometimes is to, at the beginning of the project, say, “Okay, let’s imagine two things. We’re just going to whiteboard this out. What does this look like as a roaring success,” right? “What has happened? What’s different about your organization? What has changed? Now let’s imagine what if this is a failure. Why is it a failure? What went wrong? And that can help draw out those sometimes unarticulated expectations and identify risks, identify misalignment of goals, identify unrealistic expectations, and again, it’s much better to do that at the front of the project than at the end of the project.

33:56 Okay. Okay. Well, I guess this wraps it up. Thank you all for coming. We appreciate your questions. We appreciate you being here.