Best Practices for Using Predictive Analytics to Extract Value from Hadoop

Turning big data into tangible business value can be difficult even for highly skilled data scientists. Many data scientists and analysts don’t have a deep understanding of Hadoop, so they struggle with solving their analytics problems in a distributed environment. Distributed algorithms are not always easy and intuitive, and there are many different approaches.

Watch this webinar to learn best practices for extracting value from Hadoop. You’ll see:

  • The pros and cons of different approaches to extracting predictive analytics value from Hadoop
  • Specific use cases that are good matches to each approach
  • How a visual predictive analytics platform makes creating and executing predictive analytics in Hadoop a fast and simple process

Hello, everyone. Thank you for joining us today. I’m Yasmina Greco. I’m with O’Reilly Media, and I will be your host for today’s webcast. We’d like to begin today’s webcast by saying a big thank you to RapidMiner for sponsoring our event and let you all know, RapidMiner, the industry’s number one open source predictive analytics platform, is empowering organizations to include predictive analytics in any business process, closing the loop between insight and action. RapidMiner’s effortless solution makes predictive analytics lightning-fast for today’s modern analysts, radically reducing the time to unearth opportunities and risks. RapidMiner delivers game-changing expertise from the largest worldwide predictive analytics community. Thank you again, RapidMiner. Folks, joining us today, we have Zoltan Prekopcsak with us, and he’s going to talk to you about best practices for using predictive analytics to extract value from Hadoop. Zoltan is the vice president of big data at RapidMiner. The leader in modern analytics. He has experience in data-driven projects in various industries, including telecommunications, financial services, e-commerce, neuroscience, and many more. Previously, Zoltan was co-founder and CEO of Radoop, before its acquisitions by RapidMiner. Also, a data scientist at Secret Sauce Partners Inc, where he created a patented technology for predicting customer behavior. And he is a lecturer at Budapest University of Technology and Economics, his alma mater, with a focus on big data and predictive analytics. Zoltan has dozens of publications and is a regular speaker at international conferences. Folks, we’re very excited to have Zoltan with us today to present this webcast for you all. As we get the event started, I’d like to go over just a little housekeeping to help you get the most out of today’s webcast. First, you’ll want to open your group chat widget, if you haven’t already done so. This is where we can interact with each other during the event and where you can submit your questions for Zoltan. We find that our audiences usually have a lot of good knowledge to share. So we encourage you all to chat freely during the event. However, if you have any questions for Zoltan, please preface them with a capital letter Q, that way, we’ll know it’s for him, and we can make sure we see it for Q&A. You can also open, move, and resize any of the other widgets. We do want to let you know that we are recording the webcast today, and we will have the recording ready, usually within 48 hours. And folks, at this time, it is my pleasure to turn the program over to Zoltan. Hello, Zoltan.

Hello, Yasmina. And thank you for the introduction. So welcome, everyone, to this webinar. And as introduced, I’m heading the big data and engineering groups inside RapidMiner. And what I would like to cover today is really a couple of best practices and good approaches for doing predictive analytics on Hadoop. I have seen a lot of projects in the past, and many of those have been a failure due to using the wrong tools for the wrong problem. And I would like to kind of summarize some of the deeper, different types of problems and different types of approaches that are available for doing predictive analytics on Hadoop, and kind of provide some guidance on which one to choose for a given problem, and how to identify which one you should go for. And towards the end, I will also show how this is all done in the RapidMiner platform. So let’s get started and kick this off with some of the motivation for this topic. So one of the Gartner surveys and reports recently realized that many of those data leaks, many of those Hadoop installations are kind of useless. Useless is kind of a big word for this, but at least they are not used to their full potential. And as you dig deeper into some of the causes and reasons why this is the case, you have your usual suspects like the skills gap that we have in data science and big data, in general, and also identifying how to extract the value from those Hadoop clusters. So these are the top items where we look for those challenges and causes for these bad statistics. And so what I would like to cover today is really showing some of the failures, even bringing you some examples that we have seen why many of those data leaks are kind of useless, and providing some best practices and suggestions on how to move forward with your project to make sure you are not one of those horror stories.

So I prefer to think about these approaches in three different buckets. And these three buckets all encapsulate different approaches, different kind of technical or architectural approaches to doing data science and Hadoop. And the three of these are sampling, grid computing, and native distributed algorithms. So I will go through each of these and mention pros and cons, and when to use them, and some of the reasons for why I recommend not using them in many other cases. So let’s start with the first, sampling. So it’s kind of counterintuitive when you’re talking about big data and Hadoop, that you bring up sampling. You always hear that you should use all your data for your decisions to make sure that you tap on the full potential. And this is mostly true, but there are some cases where this is simply unnecessary and adds additional overhead, additional time to your project if you’re using full-blown big data solutions for very simple problems that can be achieved by working on the data sample. So just to make sure that what I mean by sampling is clear for everyone, so let’s imagine that you have this Hadoop cluster on the right-hand side of the slide. So what your analytics tool does, in this case, is just pulling pieces of data out of that Hadoop cluster so you can perform calculations on it locally. So there is, obviously, data movement involved and all the processing, all the CPU power of the local machine is used.

So there are some obvious pros of this approach. It’s very simple. So if you have a good tool that you like using on your own laptop, you can just pull some data sample from Hadoop and continue using that tool. And it usually works pretty well for data exploration and early prototyping. So if you just want to have an idea how your data looks like, what are the typical distributions of the different columns, this is completely enough to work on a data sample. You do not need huge sample sizes to calculate averages and some basic statistics. And also this might sound counterintuitive, but there are some machine learning models that do not benefit from more data. So, for example, linear models with very few columns, so that if your dataset is not that wide, you only have a few columns, few attributes, then if you are using linear models, it really doesn’t make sense to run it on billions of records. Usually, you get the very same models and the very same result if you are just using a smaller subset of that billion records. Obviously, this sampling mechanism has some limitations. So some of your measuring models would indeed benefit from more data. And if you need to prepare your data, you need to clean your data. Obviously, if working on a sample where there’ll not be sufficient, you want to make sure that all your data is cleaned up before you make serious conclusions from it. And in the cases your Hadoop cloud cluster is just sitting there, it’s kind of a storage layer for you, but you’re not taking advantage of all the processing power that those machines have because you are processing everything and locally. So, yeah, there are pros and cons, but, in some cases, this is a very viable approach.

So as I said before, for data exploration, data understanding, this is very good. So typically in the first part of any big data project, I recommend just taking a sample of your data, looking at it with your regular tools, and inspect it with charts and make sure that the data is indeed what you believe it is. In many cases, I’ve seen in projects, creating huge workflows and spending countless days and weeks on processing the data, which turned out to be something else than they expect it to be. So make sure that the data set is what you want and what you want to build on, and then you can take the next step and maybe use some other distributed methods for analyzing that data. So also machine learning but very few and basic patterns. So if you only have a handful of columns and very easy to predict outcomes, then there’s really no use of a very large scale analytics tool in this case. But as your data set grows, either in terms of columns or rows, you may need a different solution than sampling. It will only scale for you up to your single machine. If you need to do some large scale data preparation, joining different data sets, then, obviously, a single machine and working on samples will not help you. Because if you just take a 10% sample from two datasets and you’ve tried to join them, you may or may not have matching records. But if you would do it on your full data sets, you may have completely different distribution of those joint records. And complex machine learning models like deep learning, which is quite popular nowadays, those definitely need lots of data to fit the patterns that they can recognize.

One area of data mining predictive analytics, which is not really compatible with the sampling approach, is looking for anomalies, some rare events. And so, for example, fraud and/or any failures in your system, those are typically rare. And hence, if you’re just looking at a sample, you may or may not spot them. So, in those cases, it’s always advisable to look for the whole data set and use everything you have to make sure that you identify all of those. And hence, the sampling approach is not good. So there are some horror stories with this sampling approach as well. So, as I said, if you start taking samples from multiple data sets and start combining them and joining them together, you’ll may very well get to some wrong conclusions just because of the bias in those sampling. So as you combine those data sets, you may get very different results than if you would get using the whole set. So we use this with some caution, but if it’s really just some data understanding, data exploration, I always recommend starting with this, and moving forward, you can use some different approaches to extend your analytics needs. So typically, the tools that use the sampling approach are data visualization tools. So they just reach out to the data source and pull some data so you can look at it. This is great for data exploration and also many of the programming languages which do not have native integrations with Hadoop or it’s not typically used with Hadoop, you just pull the data through Hive or Impala and through our JDBC, and then you start processing locally and start to make conclusions on those. So these are the typical tools that you would see with the sampling approach.

But let’s move on to something which is more interesting, probably from the big data perspective, which is grid computing. So grid computing has been around for many decades. So it’s quite important that I define what I mean by grid computing, in this case. So what I mean here is that the actual processing happens on the Hadoop cluster. So only the results are moved out. The data is remaining in there. But each of those processing machines working independently. So that’s kind of the grid concept, that you distribute your processing by just spinning out independent tasks and independent processing jobs to crunch some part of the data. So these machines do not communicate with each other. And, in this case, Hadoop is really just used for the parallel processing, and not just as a data source. On the other hand, there are some downsides of this. So it only works if your data can be processed independently. So if you split it up to smaller chunks, for example, you can easily filter out records from each chunk and then combine the result. There is nothing wrong with that. On the other hand, if you need to do some more complex processing or machine learning, for example, on that data, there is simply no way to split it up to smaller pieces and do that independently. You need to make sure that your jobs communicating with each other and calculating a fully optimal result.

So this is only good for those tasks which can be split up, and you can see this approach with some of the legacy statistical engines, for example. And they use Hadoop clusters or other distributed environments in the grid approach. For example, you can do it when you are doing optimization. You just run on the same data but with different parameters. And you want to make sure to pick the best parameters. So those are independent calculations. And after all of those independent calculations are finished, you end up with an optimum that you can later use. In this case, you’re not really taking advantage of the innovation that happens in Hadoop. You’re just using these nodes independently as processing and storage power. So it’s not the full potential, but, in some cases, it’s completely enough. And you would just add more complexity if you would go for some other solution. So it’s great, typically for computing-intensive data processing. So, for example, if you are processing time series data and you want to do some transformations on those time series, those are typically quite CPU intensive. Or if you are looking for many different patterns in those processing text documents, as another example, or image and videos, in those cases, you’re really processing those objects one by one. You do not need these machines to communicate and share information at that point. You are just pre-processing your data to make sure that you extract the right features and the right things from your original raw data format. So, yeah, it’s not really viable for data-intensive processing. So when you need to do some complex machine learning, like deep learning again or some gradient boosting trees, those typically cannot be built in this fashion because you do need those machines to communicate to achieve the greatest model. And, in general, if you have lots of interdependencies between the data sets, if you’re processing graph data, then this is also not a viable solution to use these machines independently.

So in many cases when someone is familiar with this grid approach and may have been using it way before Hadoop was available, some of those people tend to stick to the same pattern and do every kind of analytics be the grid approach. So I have seen lots of those people writing Hadoop jobs in a way that those are very basic calculations that the Hadoop jobs are doing. But that person is calling that in a big loop. So, for example, you want to calculate some behavior for users who registered on January 1st, and you just call the same script again in January 2nd, and then you call it again in January 3rd. So that’s, obviously, the mindset that for grid computing they did in the past but with this data-intensive projects on top of Hadoop, you have the opportunity to optimize it and do it in one batch, do it in a more optimized fashion to calculate that behavior from many user groups. But, on the other hand, there are some very good use cases which support this grid approach. So, for example, a famous one example from the early days of Hadoop is that the New York Times have been processing their whole archive with Hadoop. Obviously, that was a huge set of documents that they needed to process there, and they could do it independently. So they kind of misused Hadoop in this grid-computing fashion, but that was actually a very cost-effective and powerful solution for their specific use case. As soon as you want to do machine learning on top of that, then you need to move to some different approach.

So, as I said, grid computing is pretty popular with some of the legacy analytics engines or statistical engines that are available. But, obviously, most of the news, most of the discussions around Hadoop is around those native distributed algorithms, which have been created with distribution in the mind from the very start. So those are the projects like Spark that have been created with the assumption that you will have potentially hundreds or thousands of machines working on the same problem. So just to make sure, again, that I define what I mean by this. So, in this case, the tool that you’re using is designed in a way that it is running calculations on all of the machines, and all those machines start to communicate. They share information with each other to get to an optimal solution as fast as they can. So, obviously, in this case, you have a holistic view of all your data and all your patterns. So you’re not splitting up your data; you’re not missing on any details, and then it’s highly scalable. So if you design such a processing engine, with the assumption that it’s going to be distributed it’s from the very start, obviously, you can make changes and design it in a way that it scales well. And that’s what Hadoop is all about.

On the other hand, if you need to scale algorithms in this fashion, it’s really hard to develop new algorithms. So back then, when you had a single machine calculating, everything was rather easy. In this case, you need to keep an eye on how much these machines would communicate with each other. So if you design an algorithm where all of these machines send over lots of data to each other and every second, then, obviously, that will overload the Hadoop cluster and your algorithms will not be very performant. So there are some algorithms which can be very easily distributed and are a good fit for Hadoop implementations. There are some others which are very hard to distribute, and people are writing Ph.D. theses on how to do it. And, in some cases, you need to kind of lower your expectations from some of those algorithms and do not ask for a perfect result. So you do not want a perfectly optimized model. In some cases, maybe you just want something that is optimized enough that may be better calculated and distributed way because the exact solution may be just too complex to get to from this Hadoop cluster. So, obviously, the main one, the main area where this is probably the only way, is when you have complex machine learning models. So if you’re doing predictive analytics, you’re using deep learning, then you definitely need something like that if your dataset is large. So, obviously, if your data sets fits on a single machine, then you do not need to go for this solution. But as long as your data set is large and you want to do some complex machine learning, then you need native Hadoop implementation and native distributed algorithms to make them happen.

So also when you have lots of interdependencies inside the data for graph analytics, for example, you definitely need those machines to communicate with each other. And if you need to do a lot of data preparations, lots of ETL, combining data sets, like pulling user records from some other system and correlating it with your server logs, in those cases, it’s really hard to do on a sample or really hard to do on a grid approach. You really want to have all your data to be able to combine them in various ways to identify some actionable patterns that you’re interested in. But not to use it when you’re data is small. As I said, if you can fit your data on a single machine and your algorithm running in a reasonable time, then there is no reason to run it in a Hadoop cluster. And also one sample would reveal all interesting patterns. If you have basic patterns in your data, then it’s not worth the effort of going for a full-blown distributed implementation, which may take more time. So, actually, there has been some projects I have seen where the company invested so much into Hadoop that they wanted to solve each and every problem with Hadoop and make sure that all of their solutions are distributed and future proof for scaling, which in some cases makes sense when you are expected to scale beyond a single machine. But in many cases, some of your problems may always fit on a single machine. So I have seen a project where a complex machine learning model has been developed for many months on a Hadoop cluster, and then it was really defeated by a prototype model created in an afternoon on a single machine. And the person creating it wasn’t really an expert in machine learning, just the ease of tools on a single machine versus– that use of the tools in a single machine versus the complexity of having a distributed system and trying to do everything in a distributed way was very challenging. So that’s why you should sometimes just step back and evaluate, is it really something that needs a full-blown distributed solution, or can I maybe approach it in a different way, in an easier way?

So, as I said, these three different approaches have some pros and cons, and they’re different in various areas. And these native distributed algorithms are typically the Hadoop ecosystem projects. So the different SQL engines like Hive and Impala or Spark for general purpose processing, even the graph processing engine of Spark or dimension learning component of Spark or H2O, those are really all created with the assumption that they will run in a distributed way. But, on the other hand, they do not provide all the richness that machine learning research has invented for us in the past few decades. And hence, in some cases, it’s not necessarily the right choice. So the question might obviously come up: so which one do I use for a given use case? I provided some general guidelines or some examples, at least, on when to use which. But I would argue that for most of the projects that you will see in practice, especially the bigger ones, they will need all three of these. So especially sampling is great for the early stages, for the data exploration, some of the early prototyping so you can validate your idea. Does it really make sense to invest more time and effort into this project, or is it something that we can already rule out based on doing some work on a sample? Then you may need to do some complex pre-processing for time series or video or images. Grid computing is great for that. So if you have a good image processing package that works on your single machine, just distributing it to a Hadoop cluster or some other grid system and making sure that all the images you have are processed, then that’s probably the best way to go and the best thing to do. And, at the end, if you happen to do some complex machine learning and trying to predict user behavior, for example, for instance, then you may need some of those native distributed algorithms for deep learning or some gradient boosted trees, and make sure that you generate a good model from all the data you have. So, in most cases, there may be some projects when you only need two of these, but, in most cases, I would argue that having all three is the best. But the challenge is that there are many different tools for each of these. So you may use some tool for when you’re running on a single machine, you may have some grid computing system, and then you have your standard Hadoop stack. So how do you combine all of these approaches and make sure that you get the benefit of all of these? So there are a couple of solutions for this, but, obviously, RapidMiner is one of those platforms which helps you in that. So we combine all these approaches and also try to provide guidance, when to use which. So we have, for example, the Wisdom of Crowds feature, which recommends you some next steps, what to do next. But that is really helping you to decide where to go and which steps to explore further in your workflow. So let me switch to the demo, where I can show some of the basics of the platform and how we implement some of these ideas.

 

So you should be seeing my screen of RapidMiner Studio, which is our client application to work with all kinds of data sets. So it’s, in itself, an engine to process data on your single machine, but it also has extensions and ways to work with Hadoop clusters, work with different data sources, with the different databases. So the point that I will focus on today is really the Hadoop integration. That’s the main topic of this webinar. So I will show you a couple of ways how you can implement the sampling approach and the other two approaches in the product. So the view that we are seeing is kind of a data exploration view that we specifically created to allow this early exploration phase. So I will just open up one of my Hadoop connections, and it shows me all the tables that I have on my Hadoop clusters. So, in this case, for the more technical audience, we are connecting to the Hive server on the Hadoop cluster and fetching the tables from there. Whatever you have access to will show up here. And we can also investigate all the different types of attributes that those tables have, and which should already give you a good idea what this is. But the nice part is when you just double-click, you can get a small sample of that data. So, in this case, we are just pulling a thousand records. You can change that whenever you wish, but then you can already drill into the data. You can get tabular view, where you see the different values in your data set. In this case, I’m using a flight data set, which became quite popular for demonstrating big data tools. So this includes each record is a flight in the United States. So, in this case, you can see the months or the day of the months, departure time, arrival time, what was scheduled and what was the actual, the flight details, the carrier, how much time it spent taxing in and taxing out, if it was canceled or delayed for various reasons. So all kinds of interesting details about each flight. And you can go to this “statistics view” where you can investigate some of these details.

So, for example, all of my data is from a single year. So it’s 2008, in this case. So the sample I pulled only includes that. But I can see, for example, differences in the scheduled departure time. So I can open up a chart where I can see a distribution. There are some more flights in the morning. But otherwise, it flattens out and more or less evenly distributed throughout the day. This is really just my sample, so I can already get some understanding how the data looks like. I shouldn’t have some huge conclusions based on this because it’s really just a thousand records. I can pull some of more if I want to, but this is really just to explore and have some initial idea of what the data includes. So this area helps you to explore the data a bit, but if you want to do some more expansive processing, you would almost always switch to something we call “The Design View”. So I just right-clicked on the table and said that I want to create a process out of it. So this is the process concept, kind of a virtual approach that we have here. So how we differentiate between processing on your laptop versus processing on a Hadoop cluster, because it’s important to know where the processing happens, even for an analyst if you just want to do a simple operation, you want to know if it’s running on your full set or if it’s running on a sample. So we have this encapsulation called Hadoop.Next. So if you double-click and go inside, whatever you do in here will happen on your Hadoop cluster. If you come out, whatever you do outside will happen on your local machine. So you always know when you are organizing these workflows, then that particular computation will happen. So if you want to go from a sampling approach, you build your process outside here. If you want to have a grid computing or a full-blown native distributed approach, you build your process in here. Depending on where you build your process, you have different operators available and that you can pull in and that you can add to your workflow to achieve different things.

So let me pull up an example of workflow that I have created. So, in this case, what we are doing is, we are doing most of the processing on Hadoop. The only thing that comes out, you will see when I show the inside of it, is a machine learning model. You bring it back into our local machine and store it, so we can later reuse it. So that’s all that happens locally. All the rest is happening in a distributed way on Hadoop. So let’s investigate what’s in here. So as I double-clicked here, everything that I am showing here is running in a distributed way, either in a grid computing approach or a native distributed approach. The bottom line here is that you do not necessarily need to know which one you are using. So as an analyst, if I just want to convert my data or prep my data, I’m not even sure if that would require all those machines to be communicating with each other or they’d just require the grid approach, where I can just filter some records in a distributed way. So basically, typically, data preparation, if it doesn’t include joints and big aggregations, then that can be solved through the grid approach, while modeling typically needs the native distributed approach. So let me just start this process, while I’ll explain what happens here in the background. So as you put together such a process, you typically have these phases as an analyst to go through. So you have some data understanding, data exploration, that we have just seen on the other perspective. And then you need to prepare your data for modeling. To get it in a good shape, maybe generate some new features. So this “Generate Attributes” panel is exactly doing that. So I have some data on the departure time. So, in this case, I’m generating a new attribute that tells me, is it the morning flight or an afternoon or evening flight? I can also generate new attributes that tells me, is it winter? So if it’s December or January or February, I want a binary attribute that describes that because some models may or may not be able to learn that the months needs to be 12 or 1 or 2, that is winter. So I may be helping some of the algorithms to figure that out.

Also, something that the algorithm definitely cannot figure out: which are the top airports. So I need to either manually assign them and say that if any of these airport abbreviations are included, then that is a top 10 airport. Or I can build kind of a sub process here that calculates the top airports and make sure that those are handled in a different way. So let me just give a flag for those flights that, “Hey. This is a top airport. Those airports might be busy. So keep an eye on those.” And then we are just doing some conversions, selecting the interesting attributes that might be good for modeling. And, in this case, we are running a Spark decision tree model. So we already have the model built. So it is running on millions of records. So it took some time for the Hadoop cluster to process it. But basically, we have a decision tree where we can see the rules that this decision tree has identified on this dataset. So what they are trying to predict here if the flight will be late or on time. And as you see here, if the departure time, the scheduled departure time is after 11:15 and it’s winter, then it’s very likely to be late. So the blue one is “Late” and the red is “On Time”. So there is proportionately many flights which are late under these circumstances. On the other hand, here are the ones which are mostly on time. So if you follow these decision lines, if it’s before 11:15; it’s not winter, so this one is the most red one, and it’s even before 8:22, so probably that is the first flight for that airplane. So there cannot be some earlier delays for other flights that would postpone the departure of this flight. So if it’s a very early morning flight, and it’s not winter, then they are very likely to be on time. So, as you can see, very quickly, we could combine different approaches. So we could combine more of a grid approach, data pre-processing with pulling in a couple of these boxes and connecting them. And you could also combine it with real distributed decision tree algorithm that has been designed to scale out to multiple machines. And, for example, if we would want it to, we could out here pull the result back into memory. In this case, just the model and do some post-processing of it or to pull some sample data and also do some more data exploration on it. So there are really all these options you have at your fingertips to use sampling or use grid approach or native distributed approach for all your predictive analytics and data preparation needs. So let me switch back to the slides then.

 

So this was really just a brief demo that we could fit into this webinar where I wanted to highlight that if you have a platform that allows you to combine all these approaches, then it’s way easier to get to your result and achieve what you want. And you tend to misuse some of these approaches, in other cases. So this product will guide you to make the right decisions: when to sample, when to use this or that step in your process. And whoever is more interested on the technical details of how it works, so whenever you design some of these processes, you can see a process on the left-hand side. So some of these are data preparation steps, which can be expressed easily as SQL queries that we just submit to Hive or Impala. You may have some custom-pick scripts that you have received from your data scientist, doing some very specific data transformation that may not be available in SQL, you can freely use that. You can combine it with some machine learning models from Spark or use some clustering mechanisms from Mahout. The beauty of this is that each of these components follow a slightly different approach. Some may be more grid oriented, some may be more native. But as an analyst, you do not really need to worry about all these pieces, all these components on Hadoop. You just need to pull in these boxes. Conceptually, you want to filter some records. Conceptually, you want to join, then you want to build a machine learning model, and the product will take care of translating that to the specific Hadoop component. So you do not need to become an expert in Spark to be able to create machine learning models, which, in fact, in the background, are running Spark’s code.

So just another quick summary on this Hadoop integration product that I have briefly showed here today. So I think the biggest value add is, even if you know some of these Hadoop components, it’s quite impossible to be an expert on all of those. But even if you don’t, it is quite a learning experience. So the product will just speak Hadoop for you. So it will translate all these conceptual building blocks into Hadoop code through various technologies and make sure to execute it in an optimal fashion. So you do not need to worry about that much. It gives you complete insight into your big data. You can sample if you wish. You can explore the data if you wish. But you can also run it in a full-blown distributed manner, whenever that’s needed. And whenever you feel that some of these building blocks are too limiting and you do have the programming skills, you can always incorporate some of your scripts. So either your own scripts, if you can write them, or some of your colleagues may be more technical and may be able to come up with fancy-type Sparks scripts, which use some new machine learning component in Spark, you are free to do that, and you can embed it as one step in your workflow. And, obviously, as most of these systems, most of these big data systems shift towards being more secure and more regulated, this seamlessly integrates with all of those security standards. So we integrate with Kerberos authentications, which is pretty much common. It can be integrated with active directory, as well. And we support all the data authorization components like Sentry or Ranger to make sure that they comply with all those permissions that you may or may not have on the Hadoop cluster. So it’s seamless for users and quite easy to administer for IT. We do not introduce additional administration overhead. They natively integrate to those Hadoop components, as you have seen on the previous slide, and that really simplifies your journey into doing Hadoop analytics. And I hope that some of the suggestions and examples I have presented earlier in this webinar will help you to get started and achieve better success in your Hadoop projects. And hopefully, Gartner next year will report a smaller number than 90% of useless Hadoop clusters.

So these are all the slides I have for today. And then I think we can switch to a Q&A session. But, first of all, let me note that all of the products I have shown today are free to download. So you can get started with RapidMiner Studio, the client, and you can install the Hadoop integration components for free as well. So feel free to get started. Give it a try, and, yeah, engage with us if you have any questions. So with that, Yasmina can you please help with the Q&A?

My pleasure, and we have several questions that have come in. Folks, you heard. We are at the Q&A portion. If you have any questions for Zoltan, please open that group chat widget, type it in, and send it in, and we’ll take as many as we have time for. All right, we’ll take them in the order they came in, and we have a question from Pavan, asking, what are some of the use cases for image and video processing?

So, yeah, some of the companies may have huge archives of image and video data. So if you just consider healthcare, where some of the imaging technologies have developed greatly in the past few years, there is a tremendous opportunity to process them automatically. And I’m not saying we need to replace doctors or anything like that, but there is a huge potential in identifying common patterns and root causes in some cases. So I think that’s really an untapped opportunity. Obviously, for video, kind of security comes up. So all those security tapes, if you could identify patterns, there are quite many deep learning algorithms, which are pretty great at recognizing objects and different things on images and video. So I think we will hear more and more about those use cases. They are not very common today, but there is a huge potential.

Excellent. Thank you so much. The next question here is asking– this was from Dominic. Which algorithms are currently available for Spark? Just the ones Mahout contains or other ones?

No. So we are tapping on basically not just the Mahout ones, but we are tapping on all the machine learning library of Spark. So you have different types of progressions, support-vector machines, random forest. So all of those algorithms are available in a code-free way. So you can just have those pre-built building blocks and pull it into your workflow. But if you have some greatest and latest Spark package that maybe even you developed yourselves, you can also call those algorithms from BiSpark or SparkR. So it’s pretty easy to embed new algorithms if you wish, but all the standard ones are already included.

Great. Thank you. Next question here asking, is RapidMiner still open source?

Yes. So we still have an open core. So the RapidMiner Studio client is available on GitHub. We do have a couple of components, which are only in closed source, but, currently, all of our products are freely available. So you can just get started with those, start using them. And whenever you feel like it, you can upgrade to some of our commercial packages.

Thank you. Next question is asking, which is the data available for downloading?

I’m not sure which data it refers to. If it’s the flight data set that I’ve been showing, if you just search for “flight data set” in Google, this will probably be the first hit. So it includes data for different flights, for multiple decades, in the US. So you can download those.

Question asking, does RapidMiner support scheduling? For example, how to productionize the workflow to run on a daily basis at a scheduled time.

Yep. That’s an excellent question. So I think that’s really one of the benefits of the RaidMiner platform, that you have all kinds of components covering these different needs and different use cases. So RapidMiner server, which is also available, you can get started with that. You can actually schedule your processes for regular execution. You can run them daily or every hour if you wish. You can also trigger execution by calling back services. So it’s very easy to integrate into IT environments where you may have a trigger value when you need to execute some analytics process. So, yeah, it’s pretty flexible how you can set up the scheduling and kind of the production-ready aspect of it.

Thank you. A question asking, do you have any plan to implement automatic analytics feature similar to SAP predictive analytics server?

Yep. That’s an interesting question. So, personally, I’m not convinced that everything can be kind of automated. Obviously, we can guide the user, and we do so. But I think we do need a human in the loop, at least for now. We can provide templates. We can have all kinds of ways to get you started. And we also have different extensions. So RapidMiner is extensible, and we have a nice community building all kinds of extensions and new building blocks. And, for example, some of those projects are approaching this and trying to automate some of the model-building aspect or maybe even the data-preparation aspect. So, yep, I think this whole space is moving into that direction, but I think what is currently available as kind of automatic analytics today is not that convincing, at least from my perspective.

Thank you. A question here asking, how are you using– is it H2O AI in distributed Hadoop machine learning, and how are you finding it?

Yep. So we have integrating the H2O machine learning platform with our latest release in August. So that is currently only available for the client application, but we have plans to add it to our Hadoop integration. Obviously, the integration approach there is through Spark, so H2O has a project with Sparkling Water, and that’s how you can integrate Spark and H2O. So we already have a really strong Spark integration with supporting various Spark versions and flavors. So we will hook H2O into that. So that’s probably coming in the next six months.

A question here from George, asking what ML and DL algorithms are employed in RapidMiner? How do we evaluate which algorithms are best for a particular prediction or pattern analysis?

Yep. This is a great question that comes up a lot: which machine learning model should I use? So the RapidMiner platform is really rich and in the types of machine learning algorithms we support. So we have more than 200 machine learning algorithms, which is as much as it’s a great feature to have as many, it is also quite challenging to pick the right one. So we have this “Wisdom of Crowds” feature where we analyze usage behavior of different data scientists, of our users, who agree on that. So we are basically collecting best practices from their use case. So knowing what type of data and what type of pre-processing would result in this or that machine learning model being efficient. When should I use a decision tree instead of a K-NN? So this kind of questions are being solved by guided recommendations within the product. So as you’re building out your workflow, you are getting these recommendations on what the next steps should be. And we also just recently started with kind of a side project where we evaluated many of those machine learning algorithms from various aspects. So which ones do support like nominal attributes? So, for example, when you have country as one of your columns, that is not a number, and many algorithms only work with numbers. So which are the ones that support that, or which are the ones that can easily convert them to some reasonable numbers? And all kinds of things like that. So you can kind of filter down to a reasonable set of machine learning algorithms that you can pick from. And then we have lots of optimization and validation capabilities, where you can verify which one works best. So you can use building blocks to decide which algorithm would deliver the best performance.

Thank you so much. A few more questions here. This one’s from Dominic, asking which algorithm and steps do you recommend for time series normalization and prediction on Spark?

Yeah, that’s a good question. I’m not aware of any standard way to do that. So you can, obviously, start writing your own Spark code in PySpark or SparkR, which are supported within RapidMiner. There are some Spark packages available, but I’m not aware of any one of those kind of standardizing. So I think the core Spark project doesn’t have anything for time series. That’s something you either need to get from individual developers or develop yourself.

Thank you. A question asking, does Radoop have algorithms for processing LIDAR data?

I need to pass on that because I’m not sure what that means. So let’s jump to the next question. Sorry about that.

Okay. Next question is, what deep learning algorithms do you support for business use cases?

So, as I said, we recently introduced the H2O integration into RapidMiner Studio, and that has deep learning components. So it’s one of those boxes that you can pull in. And this is a quite optimized algorithm, so it will tune its parameters most of the time. So you can, obviously– if you’re an expert in deep learning, you are free to tune it yourself. But by default, it will try to find a good set of parameters that deliver a good model and do something to turn all optimizations. So deep learning can be used to all kinds of problems. I think it’s most useful when you have wide data sets. So lots of attributes, lots of columns in your data. Typically, that’s where deep learning shines. So, for example, in the case of images, each pixel is just an attribute. In video, it’s even each pixel, and each whatever millisecond, is an attribute. So there are lots of attributes. And very wide dataset you have, and if there are complex patterns in those, then deep learning can find those. But for kind of regular data sets, with like a few types of attributes, deep learning is typically not that performant and not that great. So, in those cases, I would recommend going for something like random forest or gradient boosted trees. Those are typically delivering better results and faster.

All right. Thank you. A couple more questions here. This one’s from Joe, asking, do you support export of the models created from RapidMiner to deploy to applications?

Yes, we do. Although, in many cases, I see that exporting just the model may not be enough. So, obviously, in those analytics projects, you may do some data pre-processing to get the data in shape, some data cleaning, and so on. So if you do not do that for your records that you want to score, where you want to apply your models, then, obviously, the model itself is not that useful. But, yeah, in that cases, that’s the requirement. So we can export models, and you can reuse them as Java code even or PMML for some models. But I think the preferred way to do it is really if you have a process workflow that you created and you want to use it to apply to new unseen data, then you should just add it to RapidMiner server, or you can just call up that service that triggers that process and delivers the score, the prediction of the model. So that’s way easier to integrate these predictive models into production, than just taking a model and somehow hack it into some other system.

Thank you. A question asking, can models be evaluated using standard metrics like RMSE, ROC?

Yeah. Absolutely. So I think one of the great things about RapidMiner, it’s the typically called “honest validation”. So many similar products tends to overfit models and show you optimistic estimates of model performance. So we are very focused on providing honest metrics and real metrics that you would also expect in production to happen. And so we have all kinds of validation metrics, including pharmacy. But we have dozens of more, for regression, for classification. So you can probably find everything that you would imagine in terms of metrics.

Excellent. And let’s take a couple more questions. It just came in. This one is asking, how do we use it in vibration analysis?

I have to admit, I’m not an expert in vibration analysis. I assume it’s mostly time-series based. So if you just install RapidMiner Studio, and there’s a marketplace associated with RapidMiner where you can search for extensions, some additional functionality that either we provide or some third party developers provide, then you will find a time series extension that provide you all kinds of functions and building blocks to analyze time series. So I would probably recommend that as a starting point.

Okay. Thank you. Another question’s asking, is there a MarkLogic connector for RapidMiner or a MongoDB connector besides Apache Drill?

Yep. So I’m not that familiar with MarkLogic and their integration. The ways to integrate is if they have JDBC or only BC connector, then it’s very easy to connect with RapidMiner. For MongoDB, we have a special connector developed by ourselves. Obviously, the MongoDB has a different data model than typical SQL databases. So, in that case, you get specialized building blocks operators to connect to MongoDB, and even to write back results to MongoDB. So that is also part of an extension. It’s easy if you just search for most SQL extension on the marketplace, then you can add it to your RapidMiner Studio installation.

Excellent. And with that folks, we are going to say a big thank you to you, Zoltan, for being with us today, and for presenting an incredible webcast and for sharing all your knowledge and expertise.

Yeah. Thank you very much. It was a pleasure.

Attendees that attended today, we thank you for attending. And for those of you, I know we still have a lot of questions, folks. And we will be getting to those. RapidMiner will follow up with you all. So don’t be alarmed. Your questions will get answered here shortly after the webinar. I know some of you are asking about the recording. I see your messages here. Yes, we did record the webcast and within 48 hours, maybe a little earlier, you’ll get an email from O’Reilly, letting you know it’s ready, and we’ll give you the link to access. Slides are available right now. You can click your green resource widgets. It’s the green one. The bottom of the screen there. Click it, and you can download it because I know a lot of you are asking about that as well. All right, folks. And with that, we are also going to say a big thank you to RapidMiner for sponsoring today’s webcast. And as we close out, we do want to let you know that today you learned RapidMiner is the industry’s number one open source predictive analytics platform. It’s empowering organizations to include predictive analytics in any business process, closing the loop between insight and action. RapidMiner’s effortless solution makes predictive analytics lightning-fast for today’s modern analysts, radically reducing the time to unearth opportunities and risks. RapidMiner delivers game-changing expertise from the largest worldwide predictive analytics community. A big thank you to RapidMiner for sponsoring our webcast. This will conclude today’s webcast. Goodbye, everyone.