Faster AI Deployment on Hadoop and Spark with RapidMiner and Microsoft Azure HDInsight

RapidMiner and Azure HDInsight work together to deliver a complete data science and machine learning platform for massive amounts of data using popular open source frameworks such as Hadoop, Hive, MapReduce, and Spark.

Hear from RapidMiner Product Manager, Jesus Puente and Cloud Software Engineer Beth Zeranski, Cloud at Microsoft for this 60-minute webinar where they’ll discuss:

  • Identifying the advantages of using RapidMiner with Azure HDInsights to process massive amounts of data
  • How RapidMiner and Azure HDInsight enable a broad range of scenarios including ETL, Data Warehousing, and Machine Learning
  • Further integration plans between RapidMiner and Microsoft

Looking to connect Radoop to an HDInsight cluster? Here’s how.

Hello, everyone, and thank you for joining us for today’s webinar: Faster AI Deployment on Hadoop and Spark with RapidMiner and Microsoft Azure HDInsight. I’m Hayley Matusow with RapidMiner, and I’ll be your moderator for today’s session. I’m joined today by Jesus Puente, product manager here at RapidMiner. We’re also joined today by Beth Zeranski, cloud architect at Microsoft. Beth is currently a cloud architect at Microsoft with a focus on generating insight from business intelligence, analytics, machine learning, and deep learning. Beth has a wide breadth of experience in hardware design, BIOS development, and software. Beth is known for effective product execution in which she has enabled reliable and repeatable software product delivery at a number of companies. Most recently, Beth was at VMware, and prior to that, she worked in open source for a number of years. Jesus and Beth will get started in just a few minutes. But first, a few quick housekeeping items for those on the line: today’s webinar is being recorded, and you’ll receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s live session. Second, if you have any trouble with audio or video today, your best bet is to try logging out and logging back in which should resolve the issue in most cases. Finally, we’ll have a question and answer session at the end of today’s presentation. Please feel free to ask questions at any time via the questions panel which is on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. So just the agenda for today, we’ll be giving an introduction to Microsoft Azure, we’ll also give an overview of how to leverage RapidMiner and Azure HDInsight to process massive amounts of data, we’ll go over an integration demo, and then we’ll leave time at the end for question and answer. So I’ll now go ahead and pass it over to Beth.

All right. Thank you, Hayley. So, yes, today we’re here to discuss how to do faster AI deployment on Hadoop and Spark with RapidMiner and Azure Insight. So if you could go to the next slide, please. So Azure HDInsight provides many advantages. It makes it easy and fast as well as cost-effective to process massive amounts of data, and it’s cloud native, global, secure and compliant, productive for you and your developers, low cost, and extensible. Next slide, please. Currently, it’s the only service in the industry to provide an end-to-end SLA on your production workloads. It creates optimized clusters for Hadoop, Spark, Hive, Kafka, Storm, HBase, and R, and we continue to add additional features to it. Okay. Next slide. And it’s also global. So if you go to the next slide, we can see that we have many data centers around the world, and Microsoft provides more regions than any other cloud provider. And every week and every month, it seems like they’re adding additional data centers and growing their build-out so that we have more worldwide locations, and HDInsight means that you can run that everywhere. It’s also available in Azure Government cloud in the US, China, and Germany so that we provide sovereign compliance requirements.

Okay. Next slide please. It’s also secure and compliant. So it’s an enterprise-grade protection of your data. So that’s provided along with monitoring, virtual networks, encryption, authentication, and RBAC, role-based access control. So we provide most of the popular industry and government compliance standards. Next slide, please. It also makes you and your developers productive. If you can go to the next slide, please? You can see we provide support for all the popular IDEs. So we have Visual Studio, Eclipse, IntelliJ for scale-up Python, R, Java, and .NET. Data scientists can also collaborate on the most popular networks like Jupyter and Zeppelin, and we continue to add additional functionality. Next slide please. It’s also low cost. HDInsight is cost-effective because you only pay for what you actually use. You can create clusters– oh, next slide, please. You only pay for what you actually use. You create clusters on demand, then you can scale them up or down as needed. And because you decouple, compute, and storage, we provide both performance and flexibility. Next slide, please. And it’s extensible. You can seamlessly integrate with the most popular big data solutions with a single click. You can easily extend your cluster capabilities with additional applications, edge nodes, or you can customize with script actions. Next slides, please. And HDInsight provides value in all of these verticals. So for manufacturing, banking, healthcare, government, and retail, they’ve all begun to adopt to HDInsight because of the convenience with using it. Next slide, please. And in summary, HDInsight is a fully-managed cloud service. So that makes it easy for you to have easy, fast, and cost-effective process to process massive amounts of data. And in summary, I would like to hand this over to Jesus, so he can show you how RapidMiner and HDInsight work well together and provide a very compelling solution. All right. Thank you. Over to you, Jesus.

Okay. Thanks, Beth. So yes, now I would like to talk a little bit about how both RapidMiner and Azure can work together to provide real benefit for analytics processes and projects. So first, well, RapidMiner is really a platform that’s made up with three products. One is Studio that we’ll see here in the demo. It’s the platform where you can design your processes. Then there is a Server which is the execution backend, but it’s also the platform that allows for collaboration between users, data scientists, teams of data scientists, and so on. And there is Radoop which is the connection between the Hadoop world, the big data world, and then RapidMiner. And for RapidMiner server, for example, there are several ways in which both Azure and RapidMiner can work together, with Server being the backend of the platform. It can be run within the Azure platform if there is a specific VM already configured with the database and everything ready to run. So in very few minutes, you can create your own RapidMiner environment and use it for testing, for your own processes, and anything up for collaboration, and so on. It has a bring-your-own license model, so any license that you’ve got, even the free one, works and you can use and try in the RapidMiner server there.

There are several ways in which an analytics project can grow in size. One, I’m talking now about the collaboration. One would be more people joined the project – maybe there are more use cases – then the way to grow would be to add this sort of RapidMiner Server. So with just each data scientist could have his own design environment, his own studio, and then RapidMiner Server provides that kind of collaboration layer. Also, if there are more use cases or the processes are bigger and bigger, then this kind of environment can grow to form a RapidMiner Server cluster. So these are ways in which an initial setup can grow. Another one is there is another dimension, let’s say, which is data. Right? So there can be a time when more and more data needs to be analyzed in many use cases. And then when a certain threshold is crossed in the many gigabytes or maybe terabytes of data, then the memory of a single VM or even a single physical machine is not enough, and then we need to go to Hadoop environments. And that’s when Radoop comes in, and also Azure HDInsight.

So I would say that’s the way big data can be analyzed. And this combination between RapidMiner – in particular, RapidMiner, Hadoop, and Azure HDInsight – is really, really useful for data scientists. In this couple, the complexity that comes with big data is really that Hadoop environments are not really a single product. Sometimes it’s hardly a single platform. It’s a collection of services, sometimes very heterogeneous. You have Spark for execution. You have Hive for data transformation, all that data handling. You have YARN as a scheduler. You have very different services, each one with its own language or its own way to work. And that’s when both RapidMiner and Azure create a great simplification, right? On Azure’s side, HDInsight is really a way to make infrastructure setup very easy. So in a few minutes, you have a fully-managed, fully set-up environment with all the services that you need just ready to start analyzing the data. And then on RapidMiner’s side, Radoop is the way to connect to those services. And really, it’s a way to make all this complexities related to this multiple services transparent to the user.

So we say, we speak Hadoop so you don’t have to. That means a data scientist that wants to analyze data within a Hadoop environment needs to know about Spark and how it works and how you run services or jobs there. You have to know about HiveQL. You have to know about the way everything is organized, which is really complex. And you don’t really want your data scientists to focus on the infrastructure. You want them to focus on the business use cases, right, on the data, on what they know best, and that’s really what Radoop does. So it works exactly like in RapidMiner studios, so we’ll see in the demo a few minutes later. But really, we have these boxes with co-operators that encapsulate functionality, and that functionality is not really related to Spark or Hive, to any of these services. That happens under the hood, and the only thing that the user has to do or has to handle is about this data-related functionality. So it’s about aggregating data, about joining tables, about how do I train logistic regression or a deep-learning model or anything like that. So it’s about “How do I solve my business problem,” and not “How do I compile this particular code,” or anything like that.

Right. So everything is code-free. You don’t really need to code everything. Everything is done for you. And especially, you don’t need to care about the underlying technologies. That’s done by Radoop. On the other hand, we are aware that many people use code here. There are scripts on R, on Python, and so on. You can incorporate those to Radoop’s workflows, to Radoop’s processes, so you can have your R scripts, your Python scripts, your HiveQL scripts, incorporate them into the global process. But in any case, you can do everything else just code-free with the operators. And that includes not only the typical ETL operators, but also modeling, also validating this model’s calculating accuracies, and everything that really a data scientist needs to do with the data to get really information out of that, and behavior and predictions and so on, which is really what the business needs, right? That’s solving the use case, right? And everything, as it says here, safe and sound because we not only integrate all the services, but also all the security layers. Cerberus, Apache Ranger for authorization and everything else, right? All this is covered.

This is how we do it. So I don’t really want to get into the details. This is the architecture. But basically, you have Studio here to the left, so that’s what a user would actually use Studio for the design. But also, I have added here a third-party application as a way to run your processes if you have a different scaling when you want to extract information from there. And then RapidMiner Server acts as a kind of proxy to– on one hand, if there are non-Hadoop tasks, you can run them there, and then Hadoop tasks, they are run on the most appropriate engine. So usually, data-related or data transformations using Hive, sometimes Hive on Spark or on Tez. Some of the tasks on Spark reading from HDFS using Cerberus, and so on. And HDFS, by the way, specifically for HDInsight, we support reading from HDFS or from any, for example, data lake or blob. So that’s also available as a data source. That’s the architecture. And then this picture to the right is basically how a typical process in Radoop looks like. We’ll see now in the demo, but that’s basically how it works. We add these boxes with encapsulated functionality, and the typical use cases are both in the ETL side. So usually, transforming the big data is something that takes a lot of time, that requires lots of resources. This is a way to do it in a very clear way, in a, again, purely visual coding. And then of course, the domain strength is also in the modeling, right? Modeling which is based in part the Spark MLeap library, but also on any other operator that we have in Studio, including extensions. So the typical use cases could include text analytics or time-serious, very typical business case would be predictive maintenance and many others, right?

So that’s basically a summary, but I wanted to spend more time in the demo because I think that makes things clearer, and it’s a better way to show, really, how this works. So first, maybe I want to show a little bit the environment that I am going to be using during the demo. So it’s a live demo with a real-life environment. This is Microsoft’s Azure Console. So here, I have created first an HDInsight environment, right? It’s a small one. It took like a bit more than 10 minutes to really deploy, so it’s here. It is working. So this is Ambari. This is the the manager of the environment, and I can see here all the services that I had mentioned in this complex infrastructure that we will be able to use within RapidMiner: YARN as the scheduler, Hive, Tez, and so on. I have my list of hosts, and then I can monitor also here. And then I have my RapidMiner Server, which is here. I’ll be able to have my cues for it, specifically for tasks that do not require big data, so tasks that are not forwarded to HDInsight. But also, it will be a kind of proxy to that HDInsight environment, right? And then, this will be like the the backend, and this is Studio. So this is really what a typical user, so data scientist, will will see and will work with.

This is the Canvas, where all these operators will be tracked and dropped, and here we’ll create this analytics processes. And for those who are not familiar with Studio, this is the repository. So we have local repositories, but also remote repositories in the server where people, different users, can collaborate. And I will be able to here keep my models, the results of my processes, and so on. And here I have the operators, which are really the boxes that I will be dragging and dropping to work with my– to create my workflow, my processes, right? These operators have encapsulated functionality related to data access of different things like databases, files, cloud things, lots of things related to ETL like filtering or pivoting, and then a lot related to modeling. And there, specifically in the case of Redoop, Redoop is the extension. It replicates, basically, the same thing, but with operators that under the hood call Hive scripts or Spark jobs or any other Hadoop service, in this case, HDInsight servers, right? I can search here to have all this view of everything that’s available, which is quite quite a lot, basically, in terms of EDL modeling and model validation, calculating the accuracies, and so on.

One of the main or basically the main operator here for Radoop is this one, the Radoop Nest, which is the one that encapsulates the only part that’s related to the infrastructure, which is the connection. Right here I have, or my administrator has, created lots of different connections to different environments. And here I have the one I’m going to use now, the HDInsight, and I can go here and also edit this connection. The connection can be created by just importing the data from the manager, in this case, the Amberi that HDInsight provides. And then this is also a box that I can get inside. So everything inside here is going to happen in Radoop. Everything outside is happening outside, so. And I have these points here. So for example, I can take some things out of it and also some input into and out of the Hadoop environment. Usually, you wouldn’t like to move a lot of data in and out of the big data environment because the rationale for having such an environment is to keep data and then move the jobs to where the data is, and not the other way around. But I can move things like configuration or models to keep them into the repository. Anything I can move very easily just by connecting with these lines from outside or inside the Nest, right? Then here, I can do things like, for example, retrieve some data from Hive, right?

Let me even show you. So the way to get some data here, maybe I can– so this is the way to get some data out of the Hadoop environment, of the HDInsight environment is for example, in this case, I’m reading some data from here, like this Create Data, right? I could filter it, but maybe before doing that, I can show you there is a way to really navigate all the data rows, all the data that I have in my connection here. For example, I have these two tables: Credit Data, Customer Data, and we Create Data. For example, I can see– this is just a sample. Obviously, I don’t want to bring terabytes of data from here. So this is a configurable amount of data, in this case, 1,000 samples, 1,000 rows. And here I have some information. I’m going to use this table as an example to show the functionality. So this is from a credit, so I have some information about the credit history, these credits, the amount. So it’s a finance example. And I also know if this credit was paid back or not, right? This is the kind of table. And I have another table related to the customer: so this customer, the age, job he has, how saying if he owns a house, if he’s employed or not, and so on. And I will want– so the goal of this example would be to mix the data from both tables, and then eventually come up with a prediction that I can use for different customers, and say, “Will these other customers in these different situations pay the credit back or not?” Right? That’s what I want to see here.

So as you can see here, I very easily can read three of these data from from Hive. I can also, by the way, read from external databases. I can read from files, and in this case, I would read directly from the HDFS, right? I could read, eventually, from local file system, from blob, from Data Lake, or directly from the HDFS. But in my example, I’m going to use just the tables. And this one, I used Credit Data and I used Customer Data. And ideally, I want to join them, right? So I would do something like– so I have lots of things that I can do, like renaming attributes or changing the types or selecting them for generating new attributes, something. There is a lot that I can do. In this case, I will do, as way of example, just a simple join. And so the join is, in this case, an inner join, so this works just like any typical SQL join. And I have to say, well, which attributes are common so I can join there. So in this case, it’s ID. And that’s it. So this way, I am doing some very simple ETL. As I’ve said, there is a lot that I could do, but I just want to start with just showing how this works. So now, next thing I can do is maybe as I work, I want to see if I’m doing the right thing. So I just run it in the background. I can see basically what it is doing, and the output will be this joined table, right? It takes some time because it’s a big data environment, but I can still work here, right?

So I’ve done my first initial ETL. Nothing is happening in my laptop. This is all happening in this HDInsight environment. This has finished, so I can take a look at the table. And yeah, I have now a single table with everything in it: so things related to the customer, things related to the credit itself, right? So what else I could do? Well, let’s say I have finished with my ETL, and then I want to do something like some protection, right, as I’ve said, right? So let’s say I have this predictive models, right? And in this case, these are the ones that come from MLeap, and later on, I’ll show how to do something even beyond that with any operator that we have in Studio. But let’s say I want to use the typical decision tree which is the easiest to show, so that’s the one I’ll show. You can see here, so it’s not really a black box. I can do similarly SSS if I do, for example, add a bit more again here so it’s a bit better result. And so exactly the same as if I were using a coding library, right? Only here, it’s easier because I can see everything, I can see the changes, and so on. And for example, I see here there is an error, right? Why? Because it says here I need a label. I have to say if I want to make a prediction what I want to know or what I want to predict, right? So I’m already being guided here, right? I’m also being guided here by the system, saying, “Okay, you need to do something like select what you want to predict,” in this case, whether this is creditworthy or not. And then immediately, the right operator appears, right, and it’s already there, right?

Again, this would create a decision tree, so maybe I can run the process in the background to see if I’m doing the right thing. But of course, there’s something important here and this is, I think, very interesting for data scientists. It’s not really I just want a prediction, or this is the model. This model might be very bad. I don’t know. I need to see the accuracy of it, and I want to validate really how it works. But it’s not only– so to do that, the best option is to use the validation, right? So in this case, I can do a split validation. So what this operator does is it splits the data into two parts so that I use one for the training of the model, and the other one just for the test so I can make sure that there is no bias there, so. What I do is I use the validation here. I want to see the model here, and I want to know also the results, right? And here then, I need then the training. I train my decision tree. And I also want to test, so I want to calculate the performance. So I will really apply the model here to the test. So I use the test data to check whether the model is good enough or not, right?

So again, very easily, I have finished this one. So I can check again, oh, this is my decision tree. I can check, well, it’s quite complex, but here it is. I can follow up, for example, if this is the case. It has split in people having more or less than €200 in their account, or depending on the duration of the of the job, and so on. So I can see exactly the paths within the decision tree, right? But then, the decision tree doesn’t tell me much, so I will only really find out once I have this validation. So let me again run the process in the background. Right. And one thing that’s very interesting here to notice is, as you can see, in the 10 minutes that this demo has taken, I have read data, I have done some very simple ETL, I’m trying to validation. This validation contains really a loop that is splitting the data. It’s doing the things in a certain way, and it’s also calling– part of this is using Hive as a technology. This decision tree is using really MLeap, so it’s being run as a spot job. And if you think of how much time this would take in any coding environment or in any other kind of environment, I think it’s really very valuable to see that this has taken, really, only a few minute, right? So obviously, a more realistic case would have many more operators here. The ETL would be more complex. The modeling could be different. We could try different models, so it’s not that every business problem is going to be solved in 10 minutes. We have to be honest. But the productivity that one gains by using a graphical interface like this one is really amazing. And in this case, as I’ve said, all the technologies is transparent and we have a wealth of functionality here.

So as you can say, we support pivoting, generating new attributes from the other ones, calculated by a lot of any mathematical formula, we support text, and so on. So one thing that I wanted to do out here– so we have here this. This is basically the list of the predictive models that we integrate from MLeap that includes segmentation. It also includes some calculation of outliers or predictions, and then so on. But this is a limited subset of what someone would do. So for example– this one has finished, by the way. So again, I have my decision tree and I have this performance vector. So I have here this 92%, so I know how good my model is, my class precision, my class recall, how much I can really rely on this model in a production case, right? So I can make sure that that this is good enough, right? This would be, by the way, a training and validation process. It’s also possible using this apply model to have more a scoring environment. By the way, it makes the scoring takes a shorter time, so one can mix the two worlds, the Hadoop world and the in-memory world. So any model that is created here within HDInsight, within this Radoop environment, can be exported and scored in, or applied in, an in-memory fast Web service, for example, within RapidMiner Server. That’s possible. Also, the other way around. If I have a model that I have trained in-memory, I can use it here inside HDInsight. And all these combinations can be moved. Like for example, I have moved this one out. So in this case, my model was here, created right now. So I can again take a look at the process here and everything.

So that’s within, let’s say, the Spark MLeap world. But there is other option which is even more interesting. Outside of the Radoop extension, we also have, as I’ve mentioned, lots and lots of of operators related to modeling. This case, for example, I can count here up to one more than hundredth. And there are many predictive ones like in an operation, different neural networks including a version of deep learning, support vector machines of different flavors. And also for ETL, lots of options for, like in this case, for– well, not only segmentation, but cleansing and for dealing with missing attributes and binning and so on. So all this that in principle cannot be used, we have created a different operator we called SparkRM. Let me show you the process itself. So similar Radoop Nest, I go here. So if you see here, it’s basically the same sort of processed and done before. And by the way, you saw there were some red signs here. It was validating. So first, it checks all the metadata of the tables to see if everything is correct and matches, so it’s not only checking the whole thing makes sense. It’s also checking that the data makes sense within this process, right?

So in this case, instead of the initial validation, I’m using this SparkRM. And SparkRM, what it does is, again, this is an operator where I can double-click and get inside. And this thing inside here, I am using gradient boosted trees, which is another model which I cannot use from MLeap but I can use because it’s right in RapidMiner. So I can use this one, I can use these different neural networks that we have, the different versions of Naive Bayes or SVMs, and so on. Everything can be run, and what this does is this encapsulates the whole RapidMiner process and runs it into its box. So basically, everything can be can be run then. So if I run now, this will take a bit longer. But basically, it will again do the same process and run this other modeling capabilities within a Spark jobs, right? And then what this gets is I get the different model, and I can get a different performance. So what I can do with this is I’m extending the capabilities of Hadoop, so I cannot only use the typical models that are available within Spark MLeap, or even within R and Python for Spark. Everything that’s in Studio is available, and that includes, by the way, extensions.

So in particular, I would mention too that, which are very, very useful and they provide their own use cases like time series and text analytics. So for example, in the text and in the case of time series, we provide a full extension with options for windowing, for dealing with the different aspects of the time predictions. This is very useful for predictive maintenance, for example, for predicting the behavior of customers. Anything like that is very, very useful, and everything can be done encapsulated as a SparkRM job. And similarly with text analytics, we have our text extension. And that means that things like any text data that’s included, for example, logs that are kept within the Hadoop environment, let’s say in HDInsight environment, they can be read and they can be also analyzed internally, extracting the words, calculating any aggregations there, and also creating predictions for, let’s say, sentiment analysis. We’re just reaching other use cases with that, right? Those are some of the things that you can do.

And so that’s basically the demo that I wanted to show, and I hope that I have conveyed the main message here is that I have done lots of things here: created ETL; validated processes; run different models; analyze the, in this case, decision trees. Sorry. I have been able to view my data and everything without a single line of code, and based on basically two technologies. One is HDInsight which provided the infrastructure, and we didn’t need to have a huge, I think, department for that. It just needed to click a few times, and then this whole environment was created for me. And also RapidMiner as the layer for providing the analysis of the data; analysis that goes end-to-end, meaning from reading the data from various sources we have talked here about high– or HDInsight. Or, sorry. Yes. HDFS. But also, there are alternative sources. I mean, we have the specific operators for reading from obviously different databases, SQL server, and others; but also more alternative things like, let’s say, Twitter or Salesforce and the like. Then we have all these ETL operators. And then obviously what’s strongest here is everything that’s related to modeling: modeling validation, calculation of accuracies, and then obviously the operationalization of all of these through the RapidMiner Server. So that’s basically what I wanted to show today. And yeah, so I’m done. Maybe we can have now the Q&A session.

Great. So thanks, Jesus, and thank you, Beth as well. So as a reminder to those on the line, we will be sending a recording of today’s presentation within the next few business days via email. And I see a bunch of questions coming in. We’ll go ahead and try and address those now. But if you guys have any additional questions, please feel free to enter those in the questions panel as mentioned earlier in the presentation. So I’ll go ahead and start asking some of these questions now. The first question I see here, it looks like it’s for you, Jesus. This person is asking, “Is there a cross validation operator for use within Hadoop?”

So the operator that I have used is Split Validation which is like single split. It’s not really cross validation. There is the option to use the standard cross validation that we have within SparkRM, so it’s possible to do that. But usually, the reasoning behind using Split Validation in general is that big data, on one hand, is something relatively– I mean, processes usually are long-running. So that means that if we had to run 10 times or a number of times, it would really be a long time. And also the fact that big data allows to really– if we split it in two, we can say that probably there is not really that much bias so compared to a small data. So that’s the main reasoning why the main operator there is Split Validation, but you can also use Cross Validation from the standard Studio. Yeah.

Great. Thanks. A question for you, Beth. This person’s asking, “How do I get support for HDInsight?”

Ah. Thank you, Hayley. So yes, there are several ways that you can get support for HDInsight. We have a poll service dashboard that tells you about the health of the service. We also have community support, as well as there are several options for engineering and billing support options. And we can provide you URLs so that people have those available.

Great. Thanks. This question’s for you, Jesus, I believe. “Do we have to use any specific version of Hadoop and SparkRM to work with RapidMiner 8.1, our latest version? And do we have to go with any specific Hadoop distribution?”

So regarding the versions of Hadoop, we support a number of them. So obviously, today, we are talking about Microsoft. We will support other distributions. I mean, our idea is always to support the latest and also keep support for everything that’s used in general. So also not only for Hadoop itself, so basically the distributions, but also the components. So when I create a connection, one of the things I have to specify is the Spark version. So in fact, Spark is usually a bit more challenging than other components because it evolves very quickly, so there are several versions every year. Also, it happens sometimes with Hive, with every component. So we always try to be updated, so there is no guiding principle for a specific version.

Great. Thanks. Another question here. This person is asking, “What if the training set and the testing sets are separate files? How do you deploy both in RapidMiner?”

Yeah. So the case for this single operator that I showed is it does that for you. If you have already that split made or if any of you want to do it with different operations, you can do it and then you can train the data. So training the data, you are using the models. So let me maybe go back to here. So I could try and use– this is basically the two parts, so this is part of the training and this is the part of the testing. So here, I’m using the two sides of a single operator. But in any other case, I could just use these two things in parallel or even have two different processes for it or mix them, let’s say, in two different ways here. But it’s also possible. Yeah.

Thanks. Another question for you, Jesus. This person is asking, “Is it possible to combine this with RapidMiner Auto Model?”

So Auto Model is now basically what we had this first release just a few weeks ago, and for now it’s just for in-memory, so it’s not really for big data. So what it provides is really a way to understand the options that you have for modeling. I think, in fact, the idea or the greatest value of Auto Model is really to assess the different models, so you don’t really need, in general, big data for that. You can have a sample of the data, run the Auto Model, and then once you have the right model, when you have decided that these two models work and this or the three or four don’t work because the accuracy is not good enough or is taking too long, then you can have all that into Hadoop, right? So there is no automatic translation to that, if that was the question. But I think it’s something independent that you can use it even if your use case is within big data.

Great. Thanks for the explanation there. Another question for you, Beth. This person is asking, “What are my options for moving data into a Windows Azure Storage Blob account?”

Okay. So there are a number of options for uploading data into Blob depending upon the volume that you have. We have ExpressRoute as well as many different pieces of software for doing the uploading. So what I will do is add some new URLs to the email that goes out after this that demonstrates different ways that you can do that.

Great. Thanks. Yeah. As a reminder, we’ll be sending a recording to everyone, so we’ll make sure to include some of those links that you guys are asking for. One more question. Looks like this person is asking– they’re asking about HDInsight HBase, and can I connect to Phoenix?

So at the moment, from RapidMiner Radoop, you can basically connect to or read data from Hive or directly from HDFS, which in HDInsight would be also a block or a data lake. But no, not at HBase, no.

Great. Thanks for the clarification. Yeah, so it looks like we’re just about time here. So if we weren’t able to address any questions that you had here on the line, we will make sure to follow up with you via email within the next few business days. So thanks again. Great presentation, Jesus and Beth, and thanks again to everyone for joining us for today’s presentation. We hope you have a great day.

All right. Thank you, Hayley and Jesus.

Yep. Thank you. Goodbye.