Operationalizing Artificial Intelligence with RapidMiner and Talend

RapidMiner and Talend are helping global enterprises operationalize data science and AI. With RapidMiner’s automated data science platform and Talend’s enterprise data management platform, data scientists can develop machine learning models that can rapidly scale to meet business needs and provide best in class customer experience, boost business outcomes, and outsmart the competition.

Watch this 60-minute webinar with RapidMiner and Talend where we detail how this partnership is helping organizations leverage the two platforms to operationalize predictive models in for use cases such as real-time customer experience, predictive maintenance, and fraud detection.

Hello, and welcome to today’s webinar: Operationalizing AI with RapidMiner and Talend. Just a quick thanks to the members of the Talend and RapidMiner communities who joined us today as well as newcomers. I’m Jeff Bashaw, and I’ll be your host. I run channels here at RapidMiner. And I’m joined by my colleague Bhupendra Patil who runs all solutions. We’re thrilled today to have Mark Balkenende from Talend with us as well. He’s in technical product marketing to support the event. Just three points of order before we dive in. First, today’s webinar is being recorded, and you will receive an email in the next one or two business days with the link to the on-demand version. Feel free to share that with any colleagues that weren’t able to attend today. Secondly, if you’re having any AV issues with today’s broadcast, your best bet is to log out and log back in as this resolves the situation in most cases. And third, we are going to have a Q&A portion at the end of this event, so if you have questions as we go, feel free to enter them into the Questions panel on the right side of the screen. With that, we’re ready to dive in. A quick look at our agenda today: I’ll have a few words on RapidMiner, and then I’ll frame two kind of timeworn data scientists dilemmas and how collaborating with a data engineer can work to solve those. A quick peek at our partnership with Talend and the integration. Then I’ll pass the ball to Mark, who’ll cover Talend and then frame the flow of a collaborative demo that he and BP will run for us. We’ll come back to me then with a quick case study, and then we’ll wrap up and adjourn with your questions.

The best performing companies in the world are winning with AI, and it’s not just the small ones; Netflix, Amazon, Facebook, and Google. Great companies with great brands that we all know and love like Domino’s Pizza have undertaken significant digital transformations and are seeing the appreciation in their stock grow accordingly. I would just point out here that Domino’s is a client of both RapidMiner and Talend. Just a few quick words on RapidMiner, the world’s number one data science platform. Over 500,000 community members over 30,000 global organizations over 1,000 universities in over 350 global clients. We’ve been a Gartner leader in the Magic Quadrant for Data Science and Machine Learning Platforms for six years running. In the last two years, leading the Forrester Wave on Predictive Analytics and Machine Learning. If you’re familiar with the digital watering holes that serve the data science community, you know KDnuggets, and if you there, you know that we’re the number one open-source platform for data mining and analytics, six years running. And G2 crowd consistently rates us highest ease of use. I’d now just call your attention to some of the logos of great companies that RapidMiner counts as global clients, and we’re proud to serve them.

As machine learning and artificial intelligence becomes more of a central fixture in the analytics factory, some old ways are going to need to change and we’re going to need a new paradigm, a RapidMiner way. A RapidMiner paradigm, if you will, which is characterized by lightning-fast business impact of data science projects, right? An environment that’s got depth for data scientists but is simplified, and in many cases, automated for everyone else. No black boxes. Any model that RapidMiner automatically generates can be cracked open, interrogated, collaborated upon, right, truly understood before it’s put in the path of the business and a high-value use case. It’s a unified collaboration platform, which I’ll touch on briefly in the next slide. And like Talend, we’re open source and extensible. The lead companies believe that’s an important element to success. Again, a unified platform that moved the needle for analytics teams from data prep through machine learning all the way out through model deployment. You can see the many personas that we serve with our platform, but today we’re going to focus on two: the data engineer and the data scientists, and how they can collaborate. Right. So let’s talk about the first dilemma faced by data scientists. And this is actually a Ventana Research study that was surveying enterprise clients who were citing machine learning challenges, and the number one is accessing and preparing data, right? A data science that works for RapidMiner told me that without clean data, she’s just another source of noise. The second dilemma for the data scientist is what we call the final mile dilemma, and this is starter from 2018. But over 50% of data science projects are never fully deployed. Great models that can impact the business don’t see the light of day or their day in the sun because data scientists have difficulty deploying them into business processes and applications where the rubber meets the road. So a world now where analytics teams can collaborate and operationalize data pipelines with machine learning in common governed platform across the lifecycle, prototype test production with access control and versioning, right? Two personas truly collaborating in the analytics factory.

And in terms of innovation, what we wanted to introduce today, and some of the work that we’re doing together as companies, is embedding machine learning right inside of data pipelines. Really kind of trying to tackle that that final mile problem I just described. First, the data engineer will go ahead and build a Talend flow, which includes a connector to RapidMiner. The data scientist will collaborate with a data engineer on these data prep steps. But then as Talend is moving the data, it will make a service call to RapidMiner, and RapidMiner will get the data from the Talend node, load the process in the model, execute the process in the model, and deliver that enriched data back to the next node. Talend then can ship it up to a business system for consumption. Right? This is a very simple design pattern that is fast, but we also feel is quite durable. And how neat to have the flow of data right into the data warehouse where analysts can take a look at it and take action, right, with the machine learning already onboard. Just quickly, with a consensus on collaboration – this is Ventana from 2018 again – most of the survey respondents considered the collaboration between data scientists and data engineers important, over 80%. Just a few words on our partnership: Talend and RapidMiner have a technology and innovation alliance. We’re two category-leading open-source companies with open-source web integrations,, and we’ve got more to come. But we’re excited to share with you what we have today. With that, I’m going to hand the ball to MarK. He’s going to make a few remarks on Talend and frame the collaborative demo that he and BP will run. Mark, over to you.

Everything put on mute. All right. Thanks Jeff. Looks like I’m ready to go now. So Talend at a glance. Definitely. Talk a little bit about Talend, and then we’re going to dive right into the demo. So Talend is a leader in data integration and integrity as well as big data. So a little bit about Talend: we’re over a thousand employees now, 1,100 employees, highly recognized as a leader in both Gartner and Forrester: Gartner Magic Quadrant for Data Integration and Data Quality, Forrester Big Data and Data Quality. We’re 100% growth in our cloud products, and I’ll show you a little bit about our cloud products today. Revenue continues to just grow massively every year, and we have over 3,000 customers today. And just a glance at our customers. As RapidMiner, we’re across many different industries and verticals, some very well-known names. We’re in 88 of the top Fortune 100 companies are Talend today. So a very impressive list of customers and customer stories that you can find that are using Talend today for many different use cases out there. And the data value chain is really what Talend’s mission is about. So it’s about collecting the data, being able to collect data from all sources to be able to build the governance around the data, govern that data that’s coming in and how it’s coming in, and then transform and share that data. And it’s really about the speed and trust at scale that gets to that raw data that you’re pulling in into the experiences and insights. And according to Gartner, 60% of our customers’ time or enterprise users’ time from a data engineer and data scientist is collecting that data or finding the data or looking for the data. And with Talend, we really work hard to provide tools and platforms that allow everybody in the organization to see the data and trust the data that they’re using.

And we’re to talk a little bit today, and the use case we’re going to talk about industrializing machine learning and AI with Talend and RapidMiner. And again it’s about taking that raw data, bringing it into the pipeline. And doesn’t matter what user, what persona, like Jeff said. We have data engineer personas or integration specialists, data scientists. We provide tools for data stewards to resolve data issues in the process. And everywhere from citizen integrators, we have great tools like our Stitch Data Loader and our new introduced Pipeline Designer that gives even easy web-based tools for people to ingest and transform data on any platform. And really at work across all the different platforms that are out there with SPARC and Python, we pour Databricks and Qubole from an advance analytics platform, all with RapidMiner doing machine learning for us. And it’s really building that data value chain and optimizing that machine learning, so really being able to make it easy to deploy and easy to use, just like Jeff was saying. We’re working on the pipeline process of getting that data in and out of your machine learning operations. And scaling intelligent insights and experiences is really– as Jeff talked about and just reiterate, we’re going to show you in the demo today the Talend Studio, which is one of our tools for data integrations. It’s the one that we use for the much more complex examples like we have today. We’re going to do an example called customer churn data and predict when they’re going to churn using RapidMiner. And with the RapidMiner server which BP will show you, once the model’s created and been published, it creates APIs. And we’ll show you briefly with Talend API Services how you can make your application smarter, build APIs quickly, deploy, and test them all with the entire Talend and RapidMiner platforms tied together.

So the demo, as we said, so the first thing we’re going to do is we’re going to collect call records, and I’m going to show you Talend Studio as well as Talend Cloud, how you can build the integrations with our Studio-based environment and its Eclipse-based environment. We’re going to get those different records of drop rates, loyalty flags as different sources and combine them all into one table within– Snowflake, sorry, where then BP will then take the data, you can do some massaging and doing some preparation, and then try different models, train different models and see the best final results, and then publish that out for consumption. Whoops. Too many clicks. And with a rest endpoint from RapidMiner, and we’ll show our Talend’s API Designer and Tester really helps you accelerate once you have those endpoints from RapidMiner’s server to start embedding those in the applications, so being able to share and use those insights much faster and much quicker.

So with that, I’m going to jump into the Talend Studio. All right. So for those that haven’t seen, this is the Talend Development Environment that you’re seeing on the screen, where we provide a graphical design environment for you to do some development. So just walk through this quickly. Over on the left-hand side is what we call the repository, and I’m using a project that’s out on Git Repository in our Talend environment and through the Talend Cloud so I can share projects and the integration designs with my teammates. And I have different types of jobs that are available to me. I have what’s called standard jobs that can run really on any type of Java platform where you can run our small agents. We also have big data batch jobs where we can provide Spark matching or Spark-based processing, and then big data streaming which is also Spark, but Spark streaming processes using big data environments. And below we can have different metadata stored about different database connections that we may use or files. In this case, I’m using some files with some Snowflake database connections. And over on the far right is what we call our Palette. These are hundreds of components that we provide. Some are for connections to different systems like business intelligence and cloud platforms, others are for such as data quality act, processing, doing different standardization and masking and shuffling and things like that. And these, you just drag onto the Palette, or you can do a tech search as you design and develop. It’s very quick and easy. And as you drag and drop the components on the screen, you start building your integration.

So here, I’m pulling a file off S3 and reading the file and writing it into Snowflake. I’m doing some transformations, some very simple ones, within what’s called a tMap. And you can see the configuration of the components below, what is the user, what is the table, how do I want to interact with the table, and things like that. And the next flow is just doing a lookup for the customer detail to make sure I know all the international details, and I’m providing that into another table called Customer Details. Now BP, for him, he needs the data combined. So I have a second job that’s taking the two different data sets – the call data records and the customer information – and I’m joining it on into a third table that ultimately, BP, you’re going to use as the data scientist. So again, I’m acting as the data engineer, and I’m providing some customer data for BP in this example, and he’s going to help write out those examples. So now that I have these– this particular process, by the way, is what is called ELT or pushdown on the Snowflake. All of this runs and generates SQL that’s then run on our Snowflake instance. So instead of running it through the Studio, I’ve published it up to Talend Cloud which is a fully-hosted environment. And I’ve already logged in, and I’m in what’s called the Management Console. And I want to drill in and run those two jobs together, so I’ve created what’s called an execution plan. So I’m going to go to my plans and then my production environments, and I’m going to find the one that says Process Call Records. I’m going to go ahead and run this.

So again, Talend Cloud is a fully-hosted management iPaaS environment for you. So we manage all the management processes, the jobs that you build and develop you can publish up to the cloud, and we give you different ways of running these processes. If it’s not a process that requires a big data environment or advanced analytics environment like Databricks or Qubole or Cloudera, you would then run it on what’s called a remote engine. If you want to run it inside of your environment– or we also host cloud engines where you can run things, which this happens to be running because I don’t have a lot of security concerns. So what I’ve done is created an execution plan of the two jobs that you saw in my studio. First to load the data into my cloud environment and then into my Snowflake, and then use the ELT pushdown to combine the records together. And this data, the jobs they’re already on, both of them ran successful, and I can dig into the logs, view the logs from here, drill into the actual jobs themselves. But ultimately, I want to make sure that we provide the right data to BP. And so here is the data that BP as a data scientist, I’m going to hand this over to you now so you can take over and show us how you can help predict the customer churn. So with that, I will actually do that for you quick. All right. Over to you, BP.

Excellent. Thank you, Mark. And so far, obviously Mark has done the heavy lifting in terms of getting me access to the data. Ideally then have to focus on just building the model state. He’s made sure that the data is joined, union, or whatever we need to do to create a profile for me to start working with. This is exactly the data set that Mark was showing earlier. I’m connected to the same Snowflake instance that Mark has connected to, so pretty much that’s the hand-off it did at this point. And to get started off, I’m going to quickly right-click on the table here and switch to Auto Model. Now some of you may familiar with this part of our product, but quickly to highlight what’s going on here, what happened RapidMiner is doing is showing me the raw data. Obviously, I pull this from the Snowflake instance that Mark has provided the data into. And the problem I’m trying to solve today is a prediction problem. So I’m selecting, obviously, Prediction here, and the system is asking me to figure out which column I want to focus on. So I load this as I’m immediately able to jump into my data science problem rather than having to worry about the data acquisition and wrangling and all that. But as I’ve selected a column, I hit Next here. At this point, I can’t see the distribution of the values as I can expect. The channels that are smaller fraction of our dataset pretty much validates my hypothesis here. I click Next here. Now, what RapidMiner is doing at this particular screen is showing me the quality of the data that Mark has prepared for me. And again, nothing wrong with what Mark had, but generally showing me if there’s any statistical trends or patterns in this data. It will highlight things like if there is correlation between what you’re trying to predict and a specific column, other things looking like IDs or these table values are missing values, and so on, right? And these are going to be important to understand if a certain column should be included or not. What happened when it actually makes it even further easy by quickly giving you a red/orange/green indicator.

From a business knowledge, I understand that having an international plan or maybe not having it can affect my bills if, especially, I’m making international calls. And maybe it’s a good factor to consider for churn cases. Maybe we are losing customers when they don’t have a international plan. But this time with international calls. So I’m definitely going to override the system’s recommendation here and take care of this. On the other side, customer phone number. Most of the times, obviously, this will and should not have any predictive power, so I’ll just leave it out of the equation for now. On the next screen here, what RapidMiner is offering me is a selection of particular algorithms that I can try on the data. And you notice it kind of pick some algorithm, somewhat ideal families of algorithms, so it’s not like just hundred radiations per decision tree. But different types of algorithm, different types of techniques are applied on the data set. And beyond that, it actually also does a few more additional things for us. For example, I can ask it to extract information from the dates, or sometimes it might be important to see how long has it been a customer. So rather than just using the start date, I want to calculate automatically the time between now and the start date or day of the week, month of the year, and those kind of things. You may or may not have a lot of extra data. In this particular case, I don’t have any, But if I had, let’s say, customer calls in direct goods or support tickets, that’s special information. That’s a lot of rich information, but obviously, many times that gets ignored across machine learning problems. Whereas with the RapidMiner, I could simply say Extract Best Information, and RapidMiner will start leveraging any textual column that’s in that input data set. Again, I don’t have any for now, so I’m going to leave it to off.

Along with that, what Mark prepared for me had about 18, 19 different columns. In many situations, I might have hundreds of columns. RapidMiner will allow us to automatically find the right combination of columns. And in fact, if I switch over to Automatic Feature Selection, it will also automatically generate new columns for me. A quick example could be I may have a total bill, and I may have the total number of calls. Those two columns by themselves are meaningful, but if I do something like divide the number of dollars paid per month by the average number of calls per month, I can get what the customer is spending per call. And that is a meaningful number in the telecom world to understand if a particular customer does a lot of long call, the shot calls, and those kind of patterns might help the model learn faster or better. With this one switch, RapidMiner will automatically find those relationships within columns using various mathematical expressions. And all of this could all result in hundreds and hundreds of combinations of columns, and Feature Selection and make sure you don’t get and overwhelming model here. For now, I’ll just leave it to No and hit Next. In this vertical screen, all I have to simply decide is, “Hey, do I want to run this locally, or run this on a server?” Especially as your scale of problems grows, you might want to run things in the RapidMiner server so that you can leverage the distributed computing power of our server. But for now, I’m going to save that locally and hit Next here. And the system is going to start building models for me here, and in the interest of time, I’ll switch over to something that’s doable. Generally, the models are going to be built in few seconds, as you see on the runtime here. But this is what the output looks like.

At the end of all the equations and all the training exercises, RapidMiner has highlighted the best model so far. In a whirl, I can then drill down into models in terms of accuracy, classification and/or precision, recall, f-measure, whatever, right? And you notice now, the data scientist word, I’m still not talking about putting this model prediction, and that is where a lot of problems happened, right? I am able to find models with really great accuracy here. I want to make sure I’m picking the right model for my business. Second drill down further, I can look at individual models. I can look at their performance. I can look at the lift measures. But I can also collaborate with my business to work in prediction stab here to kind of sure what the output looks like on the test dataset, and not just the output, but the reasons behind the prediction. You’ll notice we color-code to a shade of red to green, showing what is influential and not influential in the prediction of what the output is here. But I can also do simulations on the model. For example, if I switch over to the simulator here, with this set of inputs that’s on the left, my model, or at least this near-best model, predicts 97% chance of this particular combination is loyal. It will also show you the highlight of which factors influencing, so I can quickly say, hey, let’s switch to Yes, and you’ll notice my likelihood of loyal actually just drop. I can obviously tweak other parameters and validate if my model is behaving the right way. And obviously, this is something I will do in collaboration with the business guys. They understand factors, they can run hypotheses on this, and so on.

At the end of the day, looking at all the various models, the various performance measures and whatnot, as a data scientist, I should be able to pick the right model to deploy. Now, keep the problem in mind here. What we really wanted to do was help the business put this model in prediction. Finding the best model one part of this journey. I want to take this further and actually deploy this model. Now, one of the things RapidMiner does for you as it goes to Auto Model, it saves those model in a location that you prefer. So you’ll notice I have actually saved all the models and all the iterations that I’ve built so far. I can simply take one of the models, let’s say, the decision tree model, and deploy this. And the deployment process looks something like this. In this case, I’m going to deploy my decision tree model. This is my input data set. I’m going to apply the model, and then cleaning up a few names, etc. so that I can push it back to my Snowflake which is my official data store. So I have so far built a workflow that takes the model that I decided, the decision tree, and have designed a workflow that can take some input and then release it, place output of that back into the output queue. Now once we have built this workflow, this is where the data center says, “Hey, build a model.” I’ve selected the right model along with the business. And Mark, please go ahead and deploy this for me. Now, the beauty of the integration between RapidMiner and Talend is we can take this solution as you see here and make it available as a rest endpoint for Mark to consume. And it will look something as simple as this. The workflow save here. I will simply right-click and hit Browse, and that basically opens up the workflow in the RapidMiner server interface. On the very bottom-right corner, you notice there’s is an option that says Export as a Web Service. I simply click on that, and if I have to pass certain parameters on the URL, I could do that. But in this case, you’re going to pass the data points as a request party. If I hit submit here, that will create a endpoint for me, and the endpoint would look something like this. I will have a URL that I can take.

So so far, I have built a model. I have made that model available as a web endpoint, but we’re still not yet in a place where it is operationalized, right? I mean having an endpoint on one side is useless until somebody starts consuming. And that is where our component comes into play of what we are doing with Talend here. What we have jointly built as a partnership here is this new component in Talend. So remember, Mark mentioned about Palette earlier, right? Palette allows you to connect to various systems, data sources, data outputs, allows you to do a bunch of data transformations, and so on. We are now introducing a new component, as we call it here, that allows us to now start communicating between Talend and RapidMiner server. The URL of the model that we just uploaded is just simply pasted here. Obviously, RapidMiner will secure everything, so you’ll need to provide a username/password. And depending on the data volumes, you can ask the system to automatically figure out a fair size or you can provide a number. And what does this do for us? I’m going to quickly run a sample here. Imagine the data is coming in from Snowflake. I want to score that data in the model that we just deploy, and then I want to show you the output here. I’ll simply hit Run here quickly. The Talend job that we just built here is compiling, and as it’s deployed, this is what output looks like. And again, I’m logging this here to show you the quick immediate effect of this, but what really is happening here, you’ll notice each of my rows have three or four new columns. It’s actually telling me the prediction and the confidence of the predictions. Now obviously, in this case, I’m just logging it. But in real world, this would be going back into my Snowflake system or whatever the next operational system has to be. But literally, within a matter of a few clicks, I can make a RapidMiner model to a web service, and that because I choose Talend for my deployments and my ETL workflows, I can take this model and place it in to the Talend workflows that we saw here. However, obviously, the journey between the endpoint that we created and the process that is happening in this workflow needs to be more smoother, and for that we actually use a little bit more of Talend’s help here. To discuss more of that, I’ll actually switch over to Mark to present several other components of the Talend ecosystem that we used here. So Mark, back to you.

Yeah. Thanks, BP. Yeah. So what you’re seeing right now is what we call the API designers. So another way which BP was showing you how a great way with the new component for RapidMiner to integrate at a batch level within the Talend Studio and deploy it out. Another way would be to actually take that API and build out API specifications, and with Talend’s Cloud platform, we give you a way with what’s called the API Designer to start building out API specifications. And with the Talend platform you can implement that, or in this case we’re just going to be calling the API itself through the API Tester, and that’s really all we need to do. But this helps us define what the API should look like: the response, the request, and specifications around those in the AOS 3 versions. All right. That’s hard to say. And we can start integrating this into applications within our enterprise much faster and easier. So we’ve defined this; now we want to test it. So BP, if you want to just click on the test API. And because it’s a shared repository between the designer and the tester, it takes the information that I provided for the endpoint and puts it right into a test method for me. It has the JSON that I defined as the body, and all I have to do is hit Send, as long as I don’t need any authentication, which we didn’t set up for the demo. And once I hit Send, it’s going to send out that request, and I can start seeing the results immediately how the API works. I can test it. And now, we can start embedding these analytics and the results of our predictions right into applications such as your call center application and things like that. So it’s just another way of Talend helping you and RapidMiner bring trust and speed to your data, and share it even that much faster by helping you build out the full and the end cycle. So thanks, BP.

Excellent. Thank you, Mark. With that I’ll bring it back to Jeff, who would summarize our demo for the day.

Thank you, BP. As we know, customer analytics are near and dear. We’ll talk a little bit more about that on the next slide. But it’s important to note that any subscription business, whether you’re a telco or something else, a 1% impact on churn can get quite meaningful quickly. So. We see AI case studies everywhere in terms of high-value cases across industries. We called out sort of Domino’s on the front end. Right? But we see them in manufacturing and retail, like Domino’s, healthcare, government, transport, other services, and it’s a pretty even mix for us. What we do see is that these projects across these industries really fall into three main lanes: the customer analytics around driving revenue, which I just mentioned; the operational analytics around reducing costs, and that can be in all manner of projects; and then risk analytics as well around avoiding risks, and that can be things like finding outliers, tech fraud, and so forth. But here’s a case study with RapidMiner and Talend and a steel manufacturer. There are several factors that affect the yield of high-quality steel, right? And by placing sensors at different stages in the manufacturing process, the manufacturer can capture data, right, around process, quality, and yield. But when you apply RapidMiner predictive models to that data, now the manufacturer can make inline decisions, right, on adjusting downstream production, or even interrupting a process that would not yield desired results at a later stage based on the chemistry that the sensor is indicating, right? And then as we’ve talked about, using Talend, we can now go ahead and put this right to an operationalized mode in a business workflow, right? So this can just become something that is really embedding machine learning right in the flow of the data, right, and right in the flow of the metal.

Now it’s time to wrap up and QA. So the joint value proposition and the call to action. In terms of RapidMiner and Talend, we talked about improving collaboration between data scientists and data engineers, and tackling those two dilemmas that we called out around data access and the model deployment operationalization. But the end game is really about rapidly developing and deploying machine learning models to drive economic impact and digital transformation in your business. Right? In terms of call to action, please download RapidMiner products and get started. They’re great. Likewise, please get to Talend and download their products. They’re great. And we would love to hear from you. Contact us, right, for more information. That’s Mark and BP and me. So there’s our contact info with our emails. Just a quick reminder: today’s webinar is being recorded, so again, you will receive an email with a link to the on-demand version in the next one or two business days. And with that, I think we’re going to move to some of your questions. What we’ll do in this question and answer period is alternate between Mark and Bhupendra picking the questions. So Mark, over to you.

Yeah. We have lots of good questions in the Q&A already, so we’ll get started. I’m going to take the first easy one for Talend. Question: is Talend a ETL tool or database? So we are definitely not a database company at all. We are a data management and data governance and cloud company. So yes, you can do ETL, ELT, and all types of processing real-time and streaming in batch with Talend. So again, we’re fully hosted in the cloud. We provide support for all the major cloud vendors: Azure, AWS, Google, and so on. So we are definitely a data management cloud platform tool, so. So that’s my first one. BP, you’re up next.

Excellent. I’ll take another easy win for RapidMiner. The question is, can only Auto Model-built models be deployed, or can we deploy any other sort of models also? So the short answer is you can deploy any model using the mechanism. That could be RapidMiner model that is built using the regular process, Auto Model. You can also build models using our Python or any other language of your choice, and they can be still deployed as web services and then indicated with Talend as you saw in the webinar today. Hopefully that answers your question. Back to you, Mark.

All right. So another one: is Talend Cloud Management Console available in the open-source free part of Talend? It is not. I hate to say it, but it is a hosted service provided by Talend for the Management Console and Talend Cloud itself. So you take the processes and you deploy them out to Talend Cloud. So that is a subscription service that we provide from Talend


Up to you.

Thanks. I’ll take the next question for RapidMiner. The question is, once a model has been deployed, how do you monitor it on a regular basis? So for today’s webinar, we focus purely on the integration part where we showed how you can deploy models and actually consume them in your enterprise ETL workflows powered by Talend. Obviously, as a platform, RapidMiner solves every problem along the data science project lifecycle, so we do definitely have model matching and model tracking tools. They can be set up for alerting. They can be set up for retraining and a bunch of other outputs. We will definitely reach out to you in a private tour and probably deep dive, but again, check out rapidminer.com resource pages, and you’ll find some existing examples on that. Thank you. Hopefully that answer your question. Back to you, Mark.

Sure. So does Talend Cloud run on Azure or AWS? Is it integrated within an existing Azure AWS subscription? So Talend Cloud today runs on AWS, but in the very near future, in the second part of this year, hopefully this summer or fall, we will be announcing that it will also be available running on Azure. As far as being part of a marketplace description from Azure or AWS, we’re all also working on parts of that coming out this year within what’s called Pipeline Designer, one of our data integration tools. So more to come on Azure and on the subscription base through that, so. But today, it is fully available in AWS in four different major regions, and Azure is coming very soon. So back to you, BP.

Exellent. I will actually combine couple of questions here. Somebody asked, can I keep enhancing the models that were built behind the scenes? And the short answer is yes. Remember, we created a deployment as a web service that was powered by RapidMiner process. You will have to obviously just change the model or update the model that has been preferred in that process. But the rest of the downstream process, which is the web servers, the workflows, the integration will continue to work. So in fact, the way the integration works now, your data scientist can keep continuing and improving the models, and absolutely nothing needs to change on the web service or the integration with Talend. Both of these enterprise tools will continue to work to the changes. Another follow-up question on that was, can we allow the user to select which model to use? Absolutely, yes. In this particular context, as a data scientist, I leaped forward and said I’m going to use the decision tree, but let’s say you wanted to do an A/B testing or you’re not sure which models to use. Both of those models could be exposed by the same process as a parameter, and when calling from Talend, we could have passed parameter A, parameter B, or more. Or something of that nature to pick one of the models dynamically. So in fact, between the two platforms together, you can set up a very, very complex enterprise-grade model execution framework that allows you to not only retrain, but also parameterize which models are used. Hopefully that answers a couple of questions I had. Back to you, Mark.

Excellent. Just reading through which one’s a next good one for Talend. There’s one about does RapidMiner integrate with Talend Data Fabric solution components? Any RapidMiner particular offering like Auto ML compatible with Talend? So a little bit for both of us, I think, BP. So yes. What BP showed you was showing you how we’re integrating the components into the Talend Data Fabric and being able to run the models once they’re built and ready to get the outputs from those within a Talend job on the Talend Studio. The next step might be being able to figure out how we can integrate with the auto machine learning piece of it. Maybe, BP?

Absolutely. I think this is part of a journey that we’ve just started on deployment was one of the biggest pain points, as Jeff highlighted. So we wanted to start at that. But as the two platforms help across the data spectrum in organization, there are many, many opportunities for us to work together. You saw today how we can use the API management here from Talend. The data catalog and other offerings from Talend are something that we’ll be looking into next year. Absolutely. With the open-source backend on and power in terms of users and use cases, I think we can jointly bring a lot of good solutions to the table here.


Hi. Another quick question: can RapidMiner be integrated with other data organization tools, Power BI, to represent data and use its own visualization? So the short answer is yes. We can work with not just Power BI, but Tableau, Qlik, and a bunch of other platforms. I think the answer is same also for our Talend counterparts. So both of this platform will work with visualization platforms, but we also have our in-built visualization layer, if that is what you wish to work with. But visualization is key part of the whole data story. We work with all the major players out there.

All right. Next question is, can Talend access data via Source APIs, or is it limited to existing libraries or connectors? So we’re very extensible, and that’s where our open source and extensibility comes in in our heritage and our background with Talend. So it’s very easy, especially with APIs, to call APIs within a Talend process and retrieve the data in many different formats, where there’s JSON or Avro, Parquet, whatever it is that the data is coming back from the APIs. We also have a very extendable library and toolkit to build components. If you need to build a customer, you want to build a custom component and have it work within the Talend platform, which is exactly how the RapidMiner component was built with BP, so. So again, very extendable open-source capabilities within Talend to give you that ability to connect to things that we may not provide out of the box, so.

I guess I’ll take the next one here, which probably applies to both RapidMiner and Talend, though the question was focused on RapidMiner. The question is, what is a max size of data RapidMiner can handle, and what’s the best set of configuration to handle big data volumes? Sure. The short answer is both RapidMiner and Talend, I think, we’ve been designed from day one to solve enterprise needs. RapidMiner Studio, which is where you saw me using Auto Model today, leverages the computing power of not just your laptop, but also you can offload things to the server as you saw. But also as things start going beyond the scale of a traditional server, the RapidMiner platform can work with your infrastructure. and so on. So basically, the scaling is a factor of designing the right workflows and using the right sort of infrastructure behind the scenes. But definitely, scaling is not a big concern. We have customers using RapidMiner on terabytes and petabytes of data on a very, very regular basis. I assume the answer is the same for Talend, but, Mark I’ll let you add to it.

Yeah, definitely. It’s a very similar concept. So if we can definitely process data in huge volumes and huge scales because we are native to the environment that you’re running on. So if you do start getting into massive terabytes and petabytes and things like that, you definitely want to switch over to some type of Hadoop or SPARC environment and do the processing in a SPARC-type environment where you can then scale even further out. And our process is that you develop with Talend which is run natively within the SPARC clusters and don’t depend or rely on anything from Talend once you submit it to the YARN cluster or to, for example, DataBricks, the serverless, SPARC cluster, and it autoscales for you. So definitely, it’s a design pattern, and we definitely have customers going after huge amounts of data and processing with Talend as well. So very much the same answer.

Excellent. I think one more for me here before I give it back Mark. Are all models available in RapidMiner compared to R or Python? So the short answer is yes. RapidMiner is embedded with Java as a backend, and it’s a scalable platform. All in all, the platform has 200 plus algorithms and support for deep-learning frameworks like Garage and TensorFlow and DL4j. However, the beauty of the platform is, again, we integrate very well with R, Python, and other languages of choice out there, so you definitely are not limited to the algorithms that we provide. But you can actually expand the ballot using R or Python as you wish. Hopefully that answers your question. Thank you.

All right. So, another question. Looks like might be the last Talend one for now, but can Talend work with on-premise environments in terms of performance, meaning on-premise network, cloud, Talend Cloud? So we absolutely can work in on-premise environments. We can work in an on-premise mode both with Talend Cloud, if you don’t want to use the Talend Cloud environment, or the control and the managing subscription from Talend Cloud. We do have a on-premise version. But our cloud environment does work in hybrid mode as well where you can have both Talend Cloud as well as on-premise, where the engine that would be running the processes, whether that’s on your SPARC clusters or just on a Java-type processing framework, can run all inside of your on-premise hardware and networks or even in a cloud VPC environment so that the data never leaves your environment and stays within your private networks as well. The performance, again, just like the last comment about performance, it all depends on what is the workload and where do you want to process. And if it’s a huge amount of workload, you may want to think about using a SPARC platform or a Hadoop platform for something like that as well. But it really comes down to what is the task that you’re doing and how do you want to process it, so.

Thanks for that. Okay. I think I’ll —

Mark and BP, you’ve got another question, BP?

I think this is the last one we can answer for now. Some of you have seen RapidMiner also, and now seen Talend. The question is won’t that affect data preparation, and so on? What’s the difference, and how can you use the two tools? So I think to quickly differentiate Talend is obviously the enterprise in a workhorse for big data.. As you notice in today’s webinars, Mark was moving data from a transactional system and putting it into a destination that was useful for the machine learning. At some point, RapidMiner will be used for some off-data prep work, but again, that is the last metadata prep work. I might be doing things like recording. I might be doing things like setting up rules for handling missing values. So those are very specific to some of the machine learning collected data preparations. So there’s a little bit of overlap there, but I think the key differentiator is your workhorse, your heavy EPL workload. Talend is best suited for this. RapidMiner is more, again, downstream project-based ETL. And obviously, we are great at building models and making them available, and Talend will be the workhorse back to putting it into your application.

Yeah, definitely. And as it says, we do have a data preparation tool as well as data stewardship and data governance with a catalog. So really, our platform, our data fabric as we call it, really gives customers the ability to have a full lifecycle of data governance. And that does include data preparation, being able to give users that know the data that aren’t as technical the ability to go into a tool such as our data preparation tool and do the modifications and format and fix data the way they need to, but then take that and integrate it back into that pipeline that I showed you so we can take our full data preparation tools and wrote what we call recipes that a business user may do on the data and embed that into a full pipeline, automate it so that every time the data gets pushed out to, say, RapidMiner, it’s going to have those preparations done on the data the way the business user needs them going into the machine-learning process for the data scientist. But it is part of the overall platform and integration and speed and trust of your data within your enterprise with data management and Talend, so. All right.

Mark and BP, are there more questions to cover? Are we finishing up here?

I think we’ve covered most of them, from what I can say.


I don’t know. BP, is there–

All right. There are a few questions which obviously we cannot answer in a public forum like this, but again, we’ll get back to you on that. But I think we have covered the questions that we could cover in a public forum here.

All right. Right.

Right. Well, thank you, guys. If we did not answer your questions today, we’ll endeavor to do that via an email here in the next several business days. But again, thank you for your questions. Just first in wrapping up, I’d like to thank my co-presenters, Bhupendra and Mark, especially Mark from Talend.

Thank you very much for your support today. I also want to thank everyone at Talend and RapidMiner that made today’s event possible. And finally, I want again shout out to the Talend and RapidMiner communities who have joined us today. Thanks for your interest. We will look forward to hearing from you and also newcomers. We look forward to bringing more innovations to market together as partners, and so look forward. There’s more to come. Thanks again for attending today’s webinar: Operationalizing AI with RapidMiner and Talend. We’re going to adjourn now. Have a great day.

Related Resources