Skip to content

Bridging the Gap: Measuring & enhancing integrity with data science

Measuring the difference between action and intent

Presented by Jeremy Osinski, Senior Manager, Forensic & Security Services at EY, Todd Marlin, Global Leader of Technology & Innovation at EY, and Mark Beluk, Associate Forensic Data Analytics at EY

How can you quantify someone’s integrity? How can businesses bridge the gap between their employees’ and stakeholders’ intentions, actions and data?

In this video, you’ll learn how to use data science to make an organization’s compliance and risk management processes more effective and efficient. By integrating multiple, disparate data sources to developing digestible front-end visualizations and case management tools combined with machine learning, you can improve organizational culture and create a better-functioning business environment.

The Problem? 80% of the data from an enterprise is unstructured. Compliance professionals are skeptical of the benefit of using machine learning. EY wants to utilize RapidMiner to fuse structured and unstructured data, and prove the usefulness of these programs.

The Solution? Using Microsoft Azure integrated with RapidMiner, EY was able to create an integrated and automated compliance program that accurately assess risk and gives the client the insights they need to move forward.

Watch the full video below.

[music] Well, thanks, everyone. Thanks for joining us today. So we’ll talk about bridging the gap, right? Measuring really the difference between an organization’s actions– an organization’s intent rather, and an individual’s actions. And then, as Scott mentioned, we’ll hopefully, conditions permitting, give you a brief demonstration of one of our EY machine learning models at work. Just from an overall perspective, I’m Jeremy Osinski, a senior manager within EY’s Forensic and Integrity Services Practice, joined by Mark Beluk, one of our lead associates, as well as Todd Marlin, who’s our global leader for technology and innovation across EY’s forensics businesses around the world. Essentially, the work in which we do, right, EY is an organization. We have 290,000 employees around the world. 4,500 of those are forensics professionals. So we really help our clients, in a nutshell, around risk management, compliance, assisting with legal needs, both in terms of investigations as well as proactive type of compliance. Proud to say we’re utilizing machine learning, advanced analytics on most, if not all, of the investigations we’re doing around the world today for organizations large and small, sectors ranging from financial services to life sciences, energy, the government, manufacturing, and so on and so forth. So really, in terms of how we think about the role of data science in the compliance context, right, we really think of something– we really ground ourselves in what we call at EY the integrity agenda. And it’s about helping an organization measure their culture, the governance around that culture, associated controls, and then the reason we’re all here in this room is really around drawing data insights and using data to help monitor and manage integrity within an organization.

So all of us in this room, right, whether we represent a large organization or a small organization or an academic institution or a startup or none of the above, right, all of our organizations have policies and procedures around conduct, around ethics, around operating with integrity. And hopefully, our leaders really live that out through a series of formal and informal messaging. But yet, for some reason, we still see numerous investigations, right, numerous sanctions being levied, executives being in prison, large fines, a reduction in market capitalization. So these issues, these incidences, keep happening. And so we’ve now utilized, really, data science. We’re utilizing data science to help really monitor, manage, measure, and hopefully bridge that gap. Todd, I don’t know if you wanted to–?

Yeah. I mean, I would just sort of say as well, EY, how do we fit into this equation? Yes, we’re doing all of these things, but we are really a knowledge provider. We have unique knowledge around how all these terrible things have happened and how to prevent them. And we’re trying to use all of the technology, including RapidMiner as a key part of that, to manage and prevent that. And bringing our unique insights. And you’re going to hear today about a delivery system that we’ve created that is designed so that our clients don’t have to put the jigsaw puzzle together of all the different pieces of technology and don’t have to put the jigsaw puzzle together around dealing with all of the data challenges to actually begin and accelerate to the point of really focusing on what they care about, which is the problems, preventing the problems, finding out the facts.

Great. Thanks, Todd. And so really, as we think about the associated data and the data involved, I’m sure you’ve all seen this statistic. There’s a number of them out there. I actually saw one just recently saying this number is more in the realm of 90% now. But essentially, at least 80% of the data in the global data sphere today is unstructured in nature. And so when you’re really looking to mine and model an organization or an individual’s activity, right, that data does not only come from ERP systems and invoices and spreadsheets, right? It also involves bringing in email and bringing in phone calls and bringing in cybersecurity logs and doing so in a way in which you’re able to take all of that data, put it together into a single platform, and then be able to mine, risk, rank, and model behavior. And so what we’ve been able to do is we’ve started our journey at EY within the forensic space about four years ago with RapidMiner, as I mentioned, using it today in the majority of our investigations and proactive matters. And one of the key ways in which we’re using it is in this sort of in this ETL-type capability around fusing unstructured and structured data. We also partner very closely with Microsoft. We’re one of the largest consumers globally, actually, of Azure, and one of the largest consumers of RapidMiner within the Azure Cloud. What’s interesting here as well is the space in which we play, right, is so diverse. The phone rings were oftentimes deployed on-site the next day or the day after, right? And so that ability to auto model, the ability to very quickly make sense of unfamiliar data, is really key to us. We have data scientists, and Mark’s just one of many around the world, that are actively creating, building, curating new models on unfamiliar data sources all day long.

Yeah. I mean, I think, just to add onto what Jeremy said, and you probably all realize, it’s not only structured and unstructured, right? You have semi-structured, right? So in the structured data, often you have unstructured data resides within it. And being able to harness that effectively is also a key part of– at least in this area but other areas as well. The other part is about sort of the data point. It’s a challenge, right? Everybody’s still grappling to understand what data you have. Then how do you get it into a format to make it useful? And then how do you do that with scale? So not only are we modeling it, but we’ve come up with reusable data models for different problems that make it easier to take the data from the format that it’s in, from all of the formats it’s in, and to extract what’s useful and make it part of business as usual to look at these different issues.

All right. Thanks, Todd. And really, when we think about the way in which we utilize machine learning and kind of how that works, right, it’s really a part of an overall ecosystem. The ecosystem we’ve built it into is our flagship analytics platform, something we call EY Virtual. We heavily leverage and practice the concept of microservices. We believe machine learning is probably amongst one of the more powerful microservices within our stack. And we’ll show you momentarily here how that all makes sense, right? So how we’re able to run a model on data, get user input, have, essentially, testing or verification or signoff by our clients or by our investigative teams around the world, and have that input be fed back into a model. What’s interesting here as well is oftentimes, the clients who engage us are attorneys or compliance professionals, C-suite executives. We’ve been able to utilize, particularly from a machine learning perspective, the highly visual nature of RapidMiner to, in some cases for the first time, explain and bring machine learning models to regulators around the world. And that’s not to say they’re in the nitty-gritty and we aren’t getting into the level of detail of every operator and so on and so forth, but at least that ability to demonstrate that we’re running a model and here’s what it does and here’s what it doesn’t do and here are the pitfalls and challenges and opportunities around it has really been, in some cases, transformational.

I think this is a key point to emphasize, which, at least what I observe in the world today, is there’s a real emphasis on innovation and change driven by data science and machine learning and AI is at the heart of it, right? And now there are sort of two camps, right? And doesn’t mean one’s wrong or right. But there’s really sort of two broad ways to deal with this. One is the thousand lines of code with Python, OR, etc., open-source libraries, and the other is things like RapidMiner, which make the experience easier to accelerate creating models. And frankly, the two can work together. The challenge that I see is, as Jeremy’s highlighted, in our world, dealing with regulators and very professionally skeptical individuals. It’s not easy or intuitive to take a Python program in its native form where you’ve downloaded six or seven open-source libraries and really explain how it works. Yeah, sure, you can throw comments in there, but can you imagine sitting in a conference room with a screen like this and pulling up a Python program and trying to explain that to an accountant or a lawyer? Doesn’t work. But at the end of the day, the visual nature of RapidMiner allows you to take it to a certain level. Are we getting into the level of what this function does or how this data element– no. But we can communicate the general essence of the flow of events, which you cannot easily do in a 10,000-line program.

Great. And so we’ll show you today one of the GBT models we’re in fact using for a real-life– or a scenario, rather, we’ve mocked up based on a very active real-life client situation of ours. So I’ll turn over to Mark to kind of take you through the demonstration.

Absolutely. And I think Todd really led into it perfectly. These days, firms face a challenge in monitoring and measuring the integrity of their employees and specifically, third-party vendors, contractors or sales folks, that may not have the same level of loyalty to a firm. They haven’t worked there throughout. But as Todd mentioned, we present specifically to legal professionals or accountants or regulators, and they don’t need to see the line-by-line details. And so what we’ve done is we bring this to a client because we like to show the process. We like to show what we actually run. But as you can see from this operator right here, it’s data preprocessing. And if you’ve used RapidMiner, if you’ve done anything with data science, you know that this operator is incredibly complicated. We use internal, external sources. It probably took the longest time for us as data scientists to make that operator function. But they don’t really need to know all that. They need to know where the data comes from and they need to know that it’s accurate. From there, we can point to, “Okay. This is where we optimize and train your model. We got model, we got tag data. This operator does some interesting things.” We can obviously open it up if they have data scientists in the room, but they don’t necessarily need that. And then lastly, we say, “Okay. We format it and ingest it in a way that you can easily digest that information.” And that’s what I’ll pull up now is our EY Virtual solution. And so this is a role-based application that we can deploy for clients, their specific needs, their specific resourcing specifications.

And so let’s say if I step into the role of a compliance manager or a executive for the team managing their third-party risk. Again, I don’t need to know all the details but I want to go in every week, every month, and assign specific individuals some cases to review, to dive into. I want those subjects to only be spending time on the highest risk. So this thing right in the middle, this risk score, is what we just showed you, right? That’s the output of a data science flow. We have a risk ranking. But these folks, as Todd mentioned, Jeremy mentioned, they’re professionally skeptical of this information. They need to be able to explain these models to the regulators, to their employees, saying, “Well, why am I researching this individual?” So if you see, “Okay. I just filtered on the top three highest risk.” You don’t see a lot of information here but it filters the other visualizations.

So I can see there’s two employees coming from the US, one coming from the UK, but the dispersement type is very high for entertainment and client entertainment. This is something that is tied to higher-risk employees. So from a subject matter expert perspective, I can understand this. I don’t need to know why this person was a 94.4 or 99.4 risk rating. That seems risky but you can see, “Okay. There’s a lot of information around it and especially around the transactions.” Now, if I only want to assign this one case, someone from the London office, I know in my head, “Okay. This is high risk. I have a great employee that specializes in London. They have some own regulations. I want to be able to tag it specifically to them.” You can then filter it down, get the underlying data, the transaction by transaction data, select it all, create the cases, and assign it directly to that individual within the tool. And you don’t need to see on the front-end the data science and the algorithms and our complicated gradient boosted tree that went into it. But you can go in, “All right. I’ll assign it to Mark. He’s our expert. This was very high risk. We want the SLA to be very short. This is a high priority.” Obviously, they can fill in additional information, maybe some explanation. Countries being affected, it’s the US and UK. And you can pass that along directly within the site to that individual.

And this is where some of the case management comes in. The employees are able to go to their case management modules, see the cases that have been assigned to them directly. You can see I have three different cases. Low, medium, and high priority. Most likely, I’m going to go in and see, “Okay. I have a new high-priority case. I want to dive into that.” Now, there’s a number of different features, workflows. They can pass it up to a– they can actually delegate this down. They can delegate it up for approvals. They can add files, attachments, a number of different features. But in terms of RapidMiner and how, really, the bread and butter of this is that model underneath and what really drives the efficiencies, this is where the value really starts to kick in and allows us to start with the firm that is afraid of analytics – you say the word analytics, people will walk out of the room – to getting a little bit more comfort with them. Is where the individuals can go in, they reviewed the case, they’ve added copious documents, their own expertise. They say, “This is dismissed. It’s an exception.” It could be for a number of reasons. They can tag those reasons in. But once they change that, that automatically updates our underlying tables and brings us into where we can actually apply this model. And it goes into the training data itself.

So that training data is then stacked with the choices that come from these individuals, from the subject matter experts. And as Scott mentioned, it’s a team sport. We don’t necessarily have the detailed underlying information that is needed. And that’s where we kind of can spit that out to the end-user and they can access that. Now, one area that we’ve noticed is a challenge with a lot of our clients is they have these specific rules, global code of conducts, that they have to follow. For example, could be if someone is traveling internationally and spends over $10,000, automatically needs to be reviewed. That’s where we can start to build out simple rules and start to drive some comfort with the analytics. So what we like to do is we like to start with the rule basis. And they might have an unstructured precedent document, a giant Word doc that says, “Oh. We flagged this one because of the cost. We flagged this one because of some other outlying circumstances.” But those precedent are typically siloed. It’s by data source, by different data source. So we never start with training data. And this is why it’s very important for us to get a holistic view, get the buy-in from different teams, and allow us to start to build the models, start to train the model on the information that comes in. And then over time, it’s when the benefits of the machine learning really, really pays off. And so–

I think this is a key point, right? And it doesn’t matter the problem space that you’re trying to solve, right? There’s always the folks that are reluctant to engage in machine learning because of the training curve, right? And it’s not very often that you’re starting with the training data, right? You’re creating it, right? But then there’s the promise that you’re going to do it better if you get there. So what Mark is talking about is how we blended the two. How do you start with a target-rich environment for what you care about to curate that set to accelerate the training, right? Because the problem is how do you compress that training curve so you don’t feel the pain of learning? And we’ve found that to be very effective to bring people along in the machine learning journey without sort of freaking them out at the start and saying, “Hey, you’ve got to do this.” Right? Because it could be scary if you don’t really understand it. And frankly, there is some merit to it because, in the beginning, there is a lot of learning, right?

Right. And the reality as well, as Todd and Mark mentioned, right, compliance, legal, internal audit, the reality is, are often cost centers within an organization, right? Using this sort of approach allowed those organizations to frankly do more with less, be far more targeted, move away from the historic roles-based tests or sample selections and random samples and the like, and really, frankly, in my view, take a much more defensible approach towards their risk management compliance processes as well as, even though not all of us in the room necessarily might represent risk management or compliance or legal functions in our organization, oftentimes, as we talked about, the fusion of different data has been able to really help organizations bring together multiple parties. So we have the sales teams and operations teams and compliance teams and legal teams and risk management teams collaborating around this data, around these models. And we’ve been able to, in many cases, help organizations, help clients, realize benefits far beyond the simple compliance, legal, integrity, right? Operational benefits, performance improvement benefits, so on and so forth.

Here’s another thing which is not directly related to the use case but more how we approach the situation and has led to what Jeremy has said, which is in our role as a knowledge provider, we’ve also aimed to allow our clients to operate on their data how they want and where they want. So we’re not coming into the situation saying, “Give us all your data and then we’ll give you the answer.” Maybe that’s how things used to be 10 years ago. We’re saying, “Let’s go on that journey together. We’ve put the puzzle pieces together to enable your team to be more effective to work with us because we have unique knowledge that can complement what you know so you can get to better decisions.” And we’ve found that that’s been a pretty effective model. And RapidMiner’s a huge part of it. But this digital approach with EY Virtual is kind of a foundational approach that we have for serving clients around data with legal compliance and internal audit so we can all partner together as an integrated team around the business problem with data science.

Thank you, everyone [music].

Related Resources. Take a Look!