Skip to content

Data Science for Cybersecurity: Identifying and mitigating threats with RapidMiner

Data science meets cybersecurity to protect your web application from bots.

Presented by Rodrigo Fuentealba Cartes, The Pegasus Group

In this video, Rodrigo explains a proof of concept architecture he uses to score HTTP requests, detect attackers and block them using RapidMiner Real Time Scoring, making use of open source tools such as rsyslog, a small agent written in Python and iptables.

The Problem? A major network suffered a DDoS attack. Being a DDoS attack, there was no way to trace the where the attack is coming from. The company was under a time crunch to protect itself from the attack and get services back up and running for their regular customers.

The Solution? Using the logs of port map data, the team was able to utilize RapidMiner to fix this from an alternate route. By using stratified sampling, they were able to identify what packets were legitimate vs which where part of the DDoS attack.

Watch the full presentation below.

00:04 Well, today, I will tell you about a little case about cybersecurity with RapidMiner. It’s not really a good case. The thing is, for example, how many of you have contact with IT departments? Yeah, please, raise your hand. How many of you are part of IT departments? How many of you are in networking, not social networking– no, no, actual cables? None of you, right? Okay. So this thing will be a little tougher. The thing is just, for networking, we already have good established software. Right? There are SIMs. There are IDS. There are IPS, etc. Yeah? And probably if you asked your networking partners, they wouldn’t be trusting RapidMiner for those network tools or network monitoring. And why should you? There are lots of other options. But what happens if time is running– you don’t have any of those tools and you have just RapidMiner on your computer, a few logs and other analytics. Here’s a story. My first thing is that Scott Genzer, my good friend over there, told us to not use presentations, but this will take a little more longer than four hours if I try to explain everything I did. So I made some screenshots, so sorry about that.

01:43 What’s the backstory? I was going to a meeting at 9:00 AM, so it was terrible, and a customer told us, “Okay. Do you happen to know anything about cybersecurity?” And, “Yeah, I work at a firm that has a cybersecurity team, but I myself am a data scientist.” So basically, the answer was no. “Why?” “We need to consult in this meeting because we are having a terrible DDoS happening right now.” “Okay. I can help you. I know a little Linux. That’s all right?.” So the backstory was that we had a little job of 1,500 services exposed to a demilitarized zone. It means there’s no security in this zone. Yeah? It shouldn’t have because those are services that are collected in different things. We normally don’t use firewalls. We normally don’t use intrusion detection systems or anything like that. So the services go freely. Yeah? But if you don’t have all those tools, what could possibly go wrong? Right? The DMZ, the demilitarized zone, it was unprotected, and it wasn’t administered by firewall, other institutions. It’s inside the government. And the DMZ was compromised. So we actually had a DDoS inside a very high-speed network. So it’s a site for disaster.

03:24 Normally, when you check DDoS or– do you happen to not know what a DDoS is? Raise your hand. Distributed denial of service is like what happens on a Friday afternoon in a highway when you have thousands of cars and they cannot pass. It’s the same on a network. So that’s the easy explanation. Basically, it’s a network saturation from many computers that are coordinated to ask for requests on a certain service. Is it more clear? If you don’t understand anything, raise your hand, and I will try to explain. So the DMZ was compromised. There was no security. So nobody knew where the attack was coming from. Those computers were not insulated from the internet, but there was like a three-layer script to actually reach the server. And I think that– and I need a little help to read this number. Again, the same thing happened, 85– 1, 2, 3– so trillions, right, trillion packets were sent through the network in 4 days. That’s around, I calculate, like 4,000 Netflix movies in 4 days. So that’s an astonishing number that we have to deal for that.

04:58 What did we have? We had the logs. And what are logs? Logs are data, data about what’s happening on the network. So I asked my friend in that scenario to send me the log, format data, and we have the details over here. So it’s 35 gigabytes of what was happening in an hour, in an hour, of network security. So we was in a rush. We didn’t have the possibility to install any kind of other software. And I told myself, “Okay. Let’s use RapidMiner to see and check and do some other things.” And this use case is really not something you will expect from a data scientist, but from a very creative person. The first thing we did was to create a stratified sampling of the information. Why? Well, actually, a stratified sampling from a sample from the data because we are not going to process 35 gigabytes of data in a speedy manner. So what I did was split for the first 3 million lines and put it on RapidMiner and, well, do a stratified sampling to see which ones are benign, which packets are really part of the network, and which packets were part of the attack. Right? So we had to score. Stratified sampling gave us a really small data set to work with, and it was really good.

06:43 And the thing is that after that, after having like 3 million packets just as a sample, that was my expression, “Are you kidding me? This is insanely lots of data.” Remember, we didn’t have a month to process the data. We didn’t have a year. We had just a few hours. As soon as possible, we needed to solve these attacks. So what model to use? In these cases, when you don’t have time, the kind of model you use is paramount because you have to try for speed, for accuracy. You have to try for doing the best you could. So what I could do, I just ran Auto Model. And I told my people, “Okay. This thing will tell me more or less what kind of model can I use, or how can I make the software learn from such a small data set so I can score the rest. And hopefully, we will have some insights from the simulator or from the information to see what could we do.” So here’s an actual process. I just gave screenshots for this. And I had two models that I could use, the GLM, the Generalized Linear Model, and the Logistic Regression. And I tried to use fast learning but I don’t know if you have success with it, but I never had. So 97.6% of classification errors. So it was really a little tough.

08:32 So, okay, I went with the GLM, and I went to play with the parameters. So what parameters are more important? And what parameters I could use in a firewall, which wasn’t a firewall, which was actually used just if tables? So we had a Linux machine to manage. The average packet size had a very important weight in the data we scored. Yeah? So this is happening in just 15 minutes, more or less the time we have to explain to you. Then I used this amazing simulator to see what kinds of parameters I could use to filter packets. Yeah? So basically, the packet size is a very important thing. And then we want a no-win situation by rewriting the rules. So I went and wrote the iptables rules to identify the packets, and in another two hours, the attack was stopped. It was a very stressful situation, but I could– help with RapidMiner and no other else. So thank you very much.

Related Resources. Take a Look!