Businesses today rely heavily on REST APIs to create and enrich their data sets, and to improve text mining model performance. Yet working with REST APIs in a data science workflow can be cumbersome and challenging. Plus creating topics that best describe natural text from chats or elsewhere adds insight, but with more complexity.
In this webinar, we explore how to enrich and analyze RapidMiner’s own website chat conversations in RapidMiner Studio by:
- Retrieving chat conversation and message data using the Drift API
- Analyzing chat message data using two different text mining techniques (TF-IDF word vectors and LDA topic creation)
- Deploying predictive NLP models on RapidMiner AI Hub (formerly RapidMiner Server) via an on-demand webservice
If you’d like to access the webinar processes please feel free to download them here.
Hi, my name is Scott Genzer from RapidMiner and today we’re going to go and do a webinar on consuming REST APIs and text mining. This is a very important topic these days. REST APIs are being used more and more and they’re being used to both enrich our data sets and being used as part of our pre-processing stages for going and doing some of the calculations that we normally would do on our machine or on-premise, and we’re doing them on external web servers. Text mining, obviously, very popular these days and I’d like to show you some of the techniques that I use in my day-to-day work and hope it find it useful for you. Today, the use case is going to be on online chat. Online chat is really one of the most exciting things that are going on in the web space and the analysis of these conversations is both very, very challenging and very interesting. You can glean a lot of information from these online chat sessions. What we’re going to do is actually we’re going to use the online chat information that RapidMiner consumes and we’re going to go and analyze that in this webinar. And so to go through the agenda quickly, we’re going to overview text mining in general and how APIs work. We’re going to go and retrieve online chat data from An API from our service that we use at RapidMiner for our chat bot, And that company is Drift, one of the most popular chat companies out there. And I’m going to sort of show you how we go and get those credentials, how do we start going and pinging those servers, getting the information we need in order to do some text mining. Then I’m going to show you some text mining of these chat conversations. I want to show you a new operator, which is the LDA operator to go and actually create topics almost organically as opposed the traditional way when you would categorize the topics and I’m going to show you both ways to do that. And lastly, I’m going to show you how you can take these models and bring them right into operationalization on a RapidMiner server. So that’s today’s agenda. There are very few slides. I’m going to get right to the demonstration part because I find that most useful. Just some housekeeping before I get started. If you want to run these processes yourself there will be a zip file containing all the processes and the data sets for you to use in practice.
There are some authentication tokens that you’re going to see on my screen. Don’t worry. These tokens are going to be deactivated immediately after this webinar, so you can’t use– obviously, the tokens I’m going to see. But if you’re wondering about security and you’re seeing tokens, don’t worry those tokens are live right now at home but they’re going to be deactivated almost immediately. Just a little bit about online chat. So online chat, again, is really becoming the go-to way for companies to interact with their customers with customer support, with product sales, churn, and this kind of thing. And what’s most interesting if you look at this survey here is that the most common use case is getting a quick answer in an emergency. And the second most popular is resolving a complaint or problem. And from RapidMiner’s point of view – we’re a software company – we see the exact same thing. People are going on to RapidMiner dot com. They see the chat. They immediately go on and that most of the time the question is about a problem. They can’t install a software or they don’t understand a piece of data science or they got stuck somewhere. And so being able to very quickly identify what the problem is and then how to give the person the best solution is a challenge unless you of course have a very highly trained human being on the other side. So we don’t necessarily use what I’m about to show you as a truly autonomous thing, but what we do use it for is to help the person on the other side of the chat identify what the issues are. So the way we deployed this is on the back-end of a Drift chat where the person on the chat would be able to go and quickly run this data mining process to be able to better ascertain what exactly the issue is even if that person is not highly trained in data science. So that’s the use case here. Again, we’re going to go and pull actual Drift conversations that people have done on RapidMiner dot com. We’re going to do some text mining and hopefully be able to go and generate topics in order to really go and look at the second item here. Resolving a complaint or a problem and to get a quick answer in an emergency. This is what we’re going to try to do. So from this point on, we’re really going to go into demonstration mode. I’m going to show you these items here retrieving the REST APIs, text mining, categorization and then operationalization using RapidMiner server.
I’m just going to give you a quick moment if you wanted to do this with me you would do it in RapidMiner studio. I’m using RapidMiner studio version 8.2. You will need two extensions to run these processes yourself. You actually sorry, you will need three extensions to run these processes yourself. You will need the text processing extension. You will need the web mining extension. And to run this new LDA operator, you will need the operator tool box extension. So make sure you have all three extensions installed by going to the marketplace. I’m going to show that to you in a minute. And make sure that you have text processing, web mining, and operator tool box. Let’s go right into RapidMiner studio. Look at this. So I’m just going to go and open a new process here just so you can orient yourself. This webinar is not intended for those who are new to RapidMiner studio. If you are new and you just are getting started, I would really encourage you to go and look at the training videos on how to use RapidMiner Studio. There’s a lot to this software. This is going to be service specifically focused on those people who already are familiar with RapidMiner Studio and want to know how to consume REST APIs and how to do some text mining using both traditional techniques and new techniques. Just a quick overview here. You’ll see in my repository I have some pre-built processes. I have a couple of models here and I have some datasets here. All of these three folders, when I’m finished will be available as one complete zip file, so if you simply create a new repository and go and load these into that folder you’ll be able to run everything that I’m showing here with the understanding that you will need your own access tokens, and I’ll sort of show you a little bit about that. And just again a reminder you will need extensions in order to run these processes here. The most important one here is, of course, the text processing extension. That’s this one right here. You can see mine is up to date. The most recent version is 8.1U. you will need this. This is the most important extension in order to do text mining. The other extension is actually also here.
This is this operator tool box and a lot of these operators are not used for text mining, but there are some that are very very very useful, in particular if you scroll down to the bottom here, you will see that there are these text processing extension operators. They’re not in the text processing extension because they’re still being worked out, but the one I’m going to use today is this first one here is extract topics from document LDA. It is probably the most powerful one I have seen in RapidMiner in a long time. And later on when I moved to operationalization, I’m going to use the apply model operator here to go and push that to the server. And then again, you will need the web mining extension. The one here. Just some notes about this web mining extension. This extension has not been updated in quite some time. You can see it’s it’s been updated since only version 7.3. Not since 2016. And I’m going to sort to talk about some of the tools in here and how you sort of get around some issues that have evolved since 2016. So I’d like to get started. Before I get going with RapidMiner, I really want to show you the beginning which is, how do you go and understand where you’re going to get the information for REST APIs and how do you get that information into RapidMiner? So I think the first logical step is actually to go to your browser here. And go in to some kind of API documentation section. Almost all APIs that I have ever found– if not all of them, have fairly good documentation. Those documentations really consist of two pieces. There is always an authentication section, where they’re going to tell you how to authenticate yourself with the server so they know who you are and back and forth. The most common way to authenticate these days is Auth 2. It is the standard that people are using and it involves a series of handshakes between you and the server. At the end result Auth 2 becomes one of these tokens. And this is very, very, very important, is that you will get an access token or authentication token. Most of the times these are called Bear tokens. This is not a webinar about API and security and Auth 2. There’s plenty of information about that on the internet. But the important point is that you will most likely need a token. It’s like a password really. But in order to talk to this API, that’s external. And almost all of the time these tokens expire, you will need to refresh them. It’s part of the security. Sometimes they’re one-time use. Sometimes you can use them permanently. That really depends on the application you’re using. You can see here that this expires at some point. I don’t know if these are seconds or minutes or hours or something and then you will normally put these things in a header in your in your GET request. And again, this is not a webinar and how to do GET and so on. I’m hoping that if you’re watching this you know a little bit about API requests, if you don’t, simply go on the Internet, there’s lots of information about GET and API requests.
But the important piece here is you’re going to need a token. That’s really the most important thing. You’re going to need an Auth 2 or some kind of authentication token. I’m going to show you how to do this in Drift. Like I said, we use Drift here at RapidMiner for our own chat bot. We really like it. and Drift, like many, many, many companies has a back-end, a developer side, which allows us to consume REST APIs. So here’s the token. In Drift, these are called applications. So I created an application called my app and then here is the token I’m going to use and again don’t get too excited that all sudden you have the authentication to get all into RapidMiner. Right after this webinar, I’m going to remove it from Drift and therefore it’s rendered completely useless. It’s usually a long string of characters like this and again sometimes you can get them directly from the website. Sometimes you need to go and create some other processes to get this token. It really depends on the API, but the most important pieces before you start anything. If you’re going to do API work, you’re going to need some kind of a way to authenticate with the server, assuming authentication is needed. There are some publicly available APIs that don’t require authentication.
I’m going to show you another example besides Drift just to show you the variance of that. I think the best example is Google Cloud. So this is Google cloud. This is my account here. And again, don’t get too excited. I’m going to delete these credentials very shortly. But again Google Cloud, works in the same way, if you’re interested where I am here I’m in the API and services in the credential section and you can see this is a key that I created a long time ago and again you can see the long string of characters here. You would simply copy this API key and then you would use it in RapidMiner. Sometimes you need to use what’s called client IDs. It really depends on what you’re doing. In Google cloud, they actually give you a lot of nice descriptions so for example instead of Drift API so you’re using the Google cloud. Google Maps Distance matrix API. So again, just like I explained Google Cloud, has very good documentation. They explain how to get an API key. They go through this whole process here. You add the API key and then off you go. And just read the documentation.
Once you start using these APIs, you’ll see they’re all very similar. And then once you do that, you can go and read the instructions here. There are two other pieces that you need to get from these websites besides the authentication key or token. I’m going to show to you on Google Cloud and then I’m going to go back to Drift and show it to you on Drift. And then I’m going to take that information and put it right into RapidMiner. So when you go and do a query to a REST API, it is done very similar to a URL a browser. Matter of fact you can do queries right in your browser if you wish. It’s exactly the same idea. And what you’re going to do is you basically go and send it a your URL. Where am I going to go? And then you’re going to give it some information, some parameters here. If you’re doing a simple what’s called a GET request, you put the parameters often right in the URL here and or you put it you could put him in a header as they’re called sometimes. You shouldn’t make sure you’re using APIs that are you HTTPS obviously, so it’s secure back and forth. And you send out this request, “I want to know this. I want to know the directions from point A to Point B. I want to know all the conversations and Drift.” It depends on what the information is. You’re going to send out a request or a query as they’re usually called. And then you’re going to get an answer back. And this answer is usually going to come in the form of an array of information which is often in JSON or XML format. XML is sort of an older format of arrays. I would say most of the time now people are using JSON. There’s reasons for that. It’s not worth getting into here. Again, HTTPS is extremely important. Google does a particularly good job in explaining, I would say, how to consume their APIs. They go through step by step how to go and fill in these parameters in the URL right here. And they’re talking about how to go and ask for directions as opposed to doing it on their website. The other part that’s important in all of these API websites after you scroll down here after you see all the different ways to query the server is how to interpret these JSON results. And if you scroll down, you’ll see a sample. So here’s the query. So I’m going to query how to get from Toledo to Madrid and so on and they get an answer and then there’s some kind of answer. And this is a JSON response. How do I know it’s a JSON response? It starts with curly brackets. JSON is always curly bracket with quotation colon quotation the commas and nesting and so on. I want to show you a more interesting one. This is another JSON response. You can see here starting with a curly bracket. This is very common.
You will get a response saying, yes. Sometimes this will be a 400 response or a 300 response. Sometimes it will just say, okay. And then you will see a nested response. Sometimes you will see hard brackets as well. Sometimes JSON just purely consists of curly brackets. This is the way that almost all APIs that I know will give you an answer back in the form of an array that looks like this. I’m going to show this to you now on Drift. It looks very very similar. So once again I’m going to go here into Drift. And I’m going to go here to the conversations API part which is the part that I’m going to show in this demonstration. The query for conversations is very, very easy. There are no parameters. You simply send a query to this URL HTTPS Drift API dot com slash conversations and the response is going to be the list of conversations that you have had with your customers. Very nice, very clean. When you send this query out. I’m going to show this to you in RapidMiner in a minute but I want to show you here first. This is an example of what you’re going to get back from Drift. Again, notice that it’s in JSON format. Right here. It’s not always that easy to read. And so I’m going to show you some tools in a minute in order to understand how to see it. This is what’s called Pretty printed. It looks very, very nice. It doesn’t always look this way. And so I use this particular tool myself online it’s the JSON viewer stack.hu and it allows me to go and pretty print JSON responses. I’ll show this to you in a minute. Okay, let’s get started with this. So this is now going to go back to RapidMiner. I just want to show you I’ve already installed the web mining extension right here and you can see all the tools for the web mining extension right here in the lower left hand corner. You can also see the text processing extension already installed. If you don’t see these operators, please make sure you installed this extension before you move forward. In particular, I’m going to be using like create document, read document, process documents and so on. Lastly, I want to make sure you have the operator tool box and in particular if you scroll down to the bottom here this is the operator we’re going to use, the LDA operator and the apply model operator from the operator tool box. I’d like to load my first processing if you’re following along with me. This is process number 0 0 1 it’s Drift API, get conversations. I’m going to walk you through this process step by step. I think the first thing that is most useful is that these authentication tokens that you will get from the website you store as a macro and RapidMiner. I find this to be the most helpful because again these tokens change. They expire. You’re going to want to easily be able to go and swap them in and out. So I simply created a macro using the set macro operator and I simply go and paste that authentic authentication token in. And I’m just calling it Auth token. And this is very important, you make sure that’s the first operator that you go and run in this process. You’ll notice it’s the first one. Obviously, if you don’t have that be the first one, and you’re doing your GET request, you’re trying to find that macro. It’s not going to find it. I did set another macro here and that’s because for testing purposes and to get all the messages I don’t want to show that to you in a demo, it takes a long time. So I’m choosing a random conversation ID. So that’s what that macro is. It’s nothing more than storing a particular number for me. I’m not going to go into these sub processes.
And I’m now going to go through step by step how I’m going to query Drift. Ask it for a list of all the conversations and get that list back and then understand it. I have three different methods that I normally use to consume REST APIs. The first one is the get page operator. And to be honest, in this current day with this current version of RapidMiner, this is now my go to way of doing it and get page is actually not a REST API tool, it’s simply querying the internet. It’s all it’s doing. It’s just as if you took this URL and put it in your browser. However, it has some additional tools which I find very useful. In particular if you have the advanced parameters setting on, you will see it has to be ability to accept cookies. It actually will show you the request method. I’m not going to talk about POST requests and this webinar. That’s a whole nother level. And it allows you to add request properties which I’m going to show you in a minute like a header. This is my number one tool now for doing API queries in RapidMiner. When I get the answer back, I’m going to use this operator again from the text processing extension as opposed to the web mining extension for Get page. In the text processing extension is the JSON the data. That takes the JSON array response which RapidMiner thinks is a document and converts it to a proper example set. Once I do that I’m going to go and do a whole lot of ETL to go and clean that all up and then I’m going to port this out. So I want to show that to you right now. So again, step by step. I’m using the get page operator. Here’s my you URL. I’m going to leave these all alone. I tend to turn on all of these things and I’ll show you why in a minute. This is a GET request. I have no query parameters for this but I did like I showed you in Google Maps you would enter them here. Key and value, key and value is very easy. And then here in the request properties– I don’t know if you remember, but they actually showed you in the Drift documentation that they have a property they call Authorization. And the value is going to be the word bearer and this Auth token. This is standard Auth 2 authentication. Sometimes the APIs will call this, authentication as opposed to authorization. You must read the documentation. Sometimes these will take other forms. You need to read the documentation very, very, very carefully. That’s it. The URL and authentication and that is it. I’m going to put a breakpoint right here just so you can see what it looks like as it comes out of the server directly.
So all I’m going to do is simply run this. I’m going to run Get page three. And you can see here is the response very quickly and if you’ll notice it is a JSON array. You can see it starts with a curly bracket and it’s long and very, very, very clumsy. So if you remember from a moment ago, I said not all JSON arrays look very nice and pretty. If I just go and select this and copy this, and go back to my browser and I go to this JSON viewer dot satck dot hu. If I just paste this in. That already is starting to look better and then it has a really nice button here called format. You just press this button format. That looks very nice. I’ll just move this over a little bit here. And you can see now it has pretty printed this array and we can see very, very clearly that this is an array that contains data. This is very standard. And then each conversation that I got from Drift has status, contact, created, and an ID number, and so on. Very, very. Helpful. Okay, once I do that, I’m going to then convert this to an example set. I’m going to put another break-point here as you can see the step-by-step. And you can see that what that does is it simply unfolds this JSON array into a RapidMiner example set. However, it does it horizontally. This is not always the most useful way to see it but this is the way it unfolds. And the reason is because you’ve had multiple queries you’d have multiple rows and so on. So it is highly likely you’re going to want to do some ETL. The first thing I’m going to do and I’ll put a break point here to show you, is you’re most likely going to want to transpose this example set from a horizontal example set to a vertical example set. And you can see that here. This is much easier to deal with. You’ll see that it does some strange things, for example, the timestamp is in exponential scientific notation all this stuff. And then you’re going to want to go through– and I end up doing this every single time, a whole bunch of small ETL steps to go and process this information. I’m going to go and stop this for a minute. I’m going to remove these breakpoints. And I’m going to ignore those Facebook posts, and I’m going to go and show you two other things here. These are two other methods to use to consume REST APIs. I’m not going to run them, but I’m just going to show you what they look like. The second method is to generate data by user specification, which is actually just empty. And you might say, “Why on earth would you do this?” And the answer is because, this operator here the enriched data by web service operator requires an input. You cannot run it without it. If I deactivate this, you’ll see you’ll get a red highlight. So the way around that is to simply put a blank generate data operator in front of it and then it’s very happy. The way to use this operator is very similar to, get page. You’re going to put the URL right here. You select Get here. And you’re going to put your request properties here. Just like before. The difference is, the great thing about this operator is that you can actually parse the JSON directly. You don’t have to use the data operator. You can simply enter JSON queries, for example, if I want to name this attribute, status, and I know how to query. I would simply write the JSON query like that. That’s the good news. The bad news is this does not work for this particular Drift API and for many APIs right now it will not work. Why? Security. It does not comply Auth 2. It doesn’t have the proper handshaking and so it will fail. It does work for sort of older APIs or APIs that don’t have any security protocols. Another way to do it is simply to shell script it. I do this every once in a while. Sometimes it’s just easier.
So what do you do. You would simply do a curl request. Here. Put in your curl request and then use the read document operator and then take that document out and then you would have to feed it into the JSON to data and ETL just like before. So again what you would do is you would break this. You would move these two down here. You would connect this up, and keep going. I’m not going to do that. I’m going to actually get those out of the way. I’m going to select my ETL and I’m now going to go and do the rest of this. Just a reminder where we are now in the. Agenda here. I’ve talked now already about retrieving online chat conversations from REST APIs. That’s what I’ve done so far. So just to run this process here. The only thing left here is I’m going to store this conversation list somewhere else. It doesn’t really matter. But I want to just run this now very briefly. And there it is. These are all these conversations in a nice clean list, all nicely cleaned up. You can see these are closed conversations. The contact, the conversation, and the timestamp. Let me get rid of this transposed break-point that I put in. So then what? Well, if you’re using a chat bot API all this is giving you is the conversation ID. It doesn’t give you the text. And the reason is because that usually is a second query. So once you have these IDs you’re going to loop through them one by one and get the conversation, the messages as they call them and Drift. One by one from Drift and it can take a while depending on how much data you have.
So the next thing I’m going to do is I’m going to deactivate this and I’m going to activate this operator, which is the Get all the messages. And this is where I’m going to use this macro. I’m only going to do this with one conversation ID. So you can see here. Same idea. I’m going to use the Get page operator. I’m going to use this conversation ID macro right here, and I’m using a slightly different URL from Drift and it’s the Drift API slash conversations and the conversation ID slash messages. Everything else is exactly the same. I’m going to do some ETL work in here. You can see this JSON, the data operator. I want to just spend a very, very quick moment talking about delays. When you’re doing API queries, you’re asking a question to a server somewhere on the internet and then you’re getting a response. You can imagine if you have 500,000 queries you need to make and you go and send them. That server may not be very happy with you. Matter of fact, it could consider it a denial of service attack and you may not want to do that to that server. Personally, I just think it’s not courteous. You’re not being respectful of their servers. You’re you’re trying to consume way too much information. You’re trying to hog that server from other people. So unless you have a special arrangement with that API, I would highly advise you you put a delay in each one. These are in milliseconds. Why do I do random delays? Just because it’s just there. You can do a fixed amount or whatever. But I would recommend you do a delay of some kind. You’d purposely throttle yourself down. A good API receiver will actually lock you out if you try to do too many queries in too short a period of time. Just a quick side note. One of the most common questions I get on the RapidMiner community is using the Twitter operator and people saying it doesn’t work. Well, the Twitter operator does work. It’s just that Twitter limits the number of API queries that you can do in any given day in any given hour. There’s a certain amount per hour and a certain amount per day. And if you just run a random process you’re going to run out of queries very quickly and Twitter, rightly so, is going to lock you down. So I do as a matter of course put delays in my API processes. So now this doesn’t matter because I’m only going to do one. But I’m going to run this process here and now I’m going to get one conversation as it says. So here’s the most important part is really the body. And you can see here, “Hi there.” And you could see this conversation is between RapidMiner and somebody else. And you could see it right here user is us, RapidMiner and then the contact is the person. “Hello. I would like to inform myself about web crawling with RapidMiner. Good idea.” And then there’s a series of questions and so on and so forth. This is very typical. We know a little bit about who that person is. We have timestamps. But the most important piece we want to look out here is not actually who this is but we want to start looking at this information. We want to take, for example, these words “web crawling” and we want to be able to automate that. So that if a person manning the chat bot on Drift gets that and doesn’t know anything about web crawling, he or she will be able to go and help that user better than sitting there trying to look through documentation and so on.
So it’s a way– it’s almost an augmented reality kind of thing. So going back to here, that’s the general idea. Now, I’ve done all that work prior to this webinar just because it saves a lot of time. So if I go now to my data here, you can see here I’m going to go to Drift messages final. And you can see here a large number of Drift messages. You can see there are about 2,700 of them here that I’ve loaded. And they’re all here. And it just makes it easier for us to conduct this thing. And the rest of the information over on the right we’re just going to strip off. Okay. So. I’m going to move now to the next thing. Once we have these conversations, we’re going to go and loop through them as I told you and we’re going to go and get these messages and create this master set. That’s sort of what I showed you a moment ago and this is how I created this Drift messages final. And now I come to the analysis part. And again, just to make sure you’re clear on the agenda for today. I am now on this next part here which just text mining of online chat conversations. We have the data. We have the messages. We loop through. We have now 2,700 different messages. I want to now go and do some text mining. I’m going to go back to RapidMiner. And what I’ve done prior to this webinar, in addition to having those messages I have created a training set and I would like to show that to you. So here’s the training set. Here’s the conversation ID. And here’s the body. And I have cleaned this up and I cleaned it up in a few ways. One of the first things I’ve done is you will notice here that there are no RapidMiner responses, only the ones from the contact. I have stripped all of those out. These are only responses from the contact. And they are responses from the contact in a certain amount– a certain point in the script where I think it’s going to be most useful. You may want to do it another way. You may want to concatenate all of these messages together, and I’ve tried those techniques. It depends on what you’re doing. And then this part here is the class that I’m eventually going to want to predict and give back to that chat bot person on our side. And these classifications were done the old fashioned way. We look at it and we go and try to manually give it a training set. I’ll show you using the LDA operator where you don’t have to do that. But in traditional text mining, you’re going to need a set of classifications. So here they are. And I have simply four classes. And how did I arrive at those classes? I will tell you very simply. I went to the founder of RapidMiner, Ingo and I asked Ingo, “How do you create these classes?” And he said, “Scott, it’s very simple. You take all those conversations you print them out on paper you take scissors you cut out the little strips for each conversation for each one of these lines and you start placing them in piles.” And he said, “If you do that long enough, the categories will emerge.” I thought that was rather surprising coming from the founder RapidMiner. The father of data science and our world. But sure enough he was absolutely right. And after cutting out strips of little pieces of paper, I quickly identified four classes. One of them I just simply called, No, which means this person is just– it was my way of saying, “This is not a useful conversation, no. This person doesn’t really need anything.” Some of them want information, general information on RapidMiner. Some of them want to learn RapidMiner, and some of them are having trouble either downloading or getting a license. These are common software queries. And I created only four of them. This is a relatively small example simply you’ll see it’s enough to start doing predictive analytics. So this is the data set that you’re going to see. Once I do that, I’m going to do some other cleanup. I’m going to go and most importantly do some replacements. I’m going to convert these two documents one by one. So I’m just going to put a break-point here. I want to show that to you. You can see that as opposed to an example set. Here. I now have a collection of documents and you can see just one by one.
Each line is now a document. And that’s very important because in text mining, using the text processing extension, and using the text processing operators in the operator toolbox, you are usually going to want to deal with a collection of documents, not an example set. You will see that indicated in RapidMiner by sort of the brown double line. The double lines in RapidMiner mean you have a collection. The brown color means you have documents. Now, I can do the text mining. This operator here, the process documents operator is probably the most important one in the text processing extension. I’m just going to show this to you down in the operator trees. You know exactly where to find it. This is the process documents operator. I could have actually skipped some steps if I had wanted to. This process documents from data operators is very, very similar to this one. They’re very similar. The only difference is that process documents takes a collection of documents. The process documents from data operator takes an example set. So you may wonder, “Scott, why didn’t you just simply connect this with this?” And the answer is. First of all, because I’m going to be later using the LDA operator, which requires a collection of documents. And second of all, I wanted to do this ETL work and then it’s easy enough to simply convert it to data to documents. You can even do this from files. You can even do this from a mail store. These operators are all virtually the same. The only thing that changes is the input. I’m going to use the most standard one, Process documents. Process documents is a nested operator just like you see with the double lines here. And this is going to go and take your documents and do whatever text processing operators you want to do on those documents one at a time.
The output creates a word vector. We’re going to use TFIDF. There are some other word vectors you can use. If you want to know more about TFIDF. You can go on the RapidMiner community. I created a knowledge base article not that long ago, very detailed explanation of what TFIDF is. What are these word vectors? I’m not going to do this here. Inside this operator, I’m going to use a series of processing tools in order to take this text and put it into a form that I think is going to be most useful to create these word vectors. You’re going to find these operators here in the transformation panel in the text processing extension. The first one I’m going to do I think is probably extremely important is to convert all of these words to one case, uppercase or lowercase. Otherwise tokens that are, for example, capital A lowercase n d and lowercase a and d will be treated differently. You don’t want that. There’s no point. I’m going to create tokens. These are segments of this text. The most common way to tokenize is to simply do what’s called non-letters which basically means that if a word is separated by a space or a period or a semicolon that’s where it will create the break. That’s the most common way. I did a whole series of token filtering. I actually put him inside a sub process here because I do like to filter a lot of these tokens. I’m filtering in a lot of different ways. The first thing I’m going to do is filter stop words in the English language because this is English text. This takes a fixed corpus of words and removes them from the documents. These are what traditional stop words you can see them here. If you go down you can read a little bit of documentation there words like A AND and OR, IF. Very, very simple words that we use in English that don’t really have a whole lot of meaning. I’m also going to filter out tokens that are small. And I do this purely from experience. I know because I’m looking for words like web mining. And I’m trying to look for words that are a little bit more complex. So I tried to go and do a minimum number of characters for– you can play with this as you want. And then I filter three more times here.
I filter stop words that are used in HTML and fonts. Sentiment analysis words and miscellaneous words. These are text dictionaries that I created on my hard drive here. I’m going to show them to you a little bit later, but they’re simply nothing more than word lists. Words that I want to remove. I want to just pause for one quick moment to talk about sentiment. A lot of times, what you’re trying to do is sentiment analysis. So you might be wondering why am I removing sentiment. The answer is because I’m not looking for it. I’m looking for topic analysis. I don’t really care what mood the person is in when they write the query in the chat bot. I just want to know what do they want to know. But if they’re angry or positive or negative that might be something for the marketing department, but it’s not going to be for customer support. It’s just not. So I actually want to remove those from the chat. If, obviously, you’re doing sentiment analysis, you would not remove it. And then these are miscellaneous words that I’ve done just by looking through and seeing that there’s some words that shouldn’t be in there. And lastly, I am going to try to generate some engrams because of words like text mining, predictive analytics. There are some pairs or some trios of words that tend to come in handy. So that’s all I’m going to do here. I’m going to port both the example set and the word list out and show you the result. And there it is. So the first thing I want to show you is the word list. You can see here the word list and the attribute name. You’ll see these are exactly the same. And then what’s actually quite important is the total number of occurrences and the document occurrences these are usually very similar. You’ll notice a total occurrences are greater than or equal to document occurrences because this means, for example, in this document there were probably two documents that had the word analysis twice. It’s not that difficult. If I sort from the top down, you will see the words that are most common – excuse me – with these particular chat conversations data, education, want, company, free – everybody’s looking for free – student, no, looking, and so on. These words make sense. Help, license. You’re going to want to tweak some of the settings in processed documents until you get a word list that makes sense. I think that’s very important. I didn’t show it to you before. I’m going to show it to you now. How do you do that tweaking? You do that tweaking my playing with these settings here by pruning these word lists. If you turn this off completely, you’re going to get a massive number of words, every word that ever existed in this entire dataset. It’s going to be too big.
Essential is the way I normally do it. Pruning below a certain percent and pruning. And so pruning below percentage, these are highly– these are very, very rare words and these are very, very common words. So you want the words that are in between these two and there’s no magic to this. There are default settings here. I forget what the default is, it might be 1%. The truth of the matter is you’re going to want to play with these. I played with these for a while. I like this word list here. And here are the word vectors.
Here’s the conversation ID, the original text but the original text tokenized. You can see these engrams here. The class from before. And here are the word vectors. And I have about 357 of these, which from my experience is a good number to start using. For those of you more familiar with text mining you might wonder why I didn’t do any stemming and the answer is I could if I wanted to. For example, analytics and analyze an analyst and analysis. These could all be stemmed together using the stemming operators. I certainly could if I wanted to. And there are the results of text mining. I’m going to turn off this break-point. I’m just going to pause for a second because again want to show you where we are on the agenda here. This is the text mining of online chat conversations. I now want to show you the categorization of unstructured data using LDA. It’s a very similar technique but it doesn’t require the use of a training set. In some sense it gives you its own training set. So I’m going to go now to the next process, which is number four and I’m going to show you a slightly different way to do this but I think it’s really incredible. I’m going to retrieve the same example set. I’m going to do our ETL. I’m going to exclude our responses if they were there. I do want to do some text processing. I don’t need to use the process documents operator. This is very, very, very important. I see this mistake all the time. You might think, “Oh, if I want to go and do tokenize. I want to do stop words or filtering or this kind of thing I should be using the process documents operator.” No. And the reason, no, is that this is used to create word vectors. But I don’t want word vectors I simply want to process these documents. It’s confusing because you put all of those things inside here and you would think that this is the only operator to use, but you you don’t do it. You simply if you want to use the LDA operator, simply use the loop collection. It’s much simpler.
You simply take the collection of documents, put it in a loop collection operator, and then port that out to the LDA operator. In here is exactly the same technique you saw before. Transferring cases, tokenize, some filtering, and so on. Once you do that, that’s all you need to do. You put it into this, I think, magical operator, extract topics from document LDA. This uses an algorithm to go and try to find topics based on those words that it finds, and it tries to organize them for you. And the amazing thing is that it doesn’t just give you one topic per document, it actually gives you a predicted probability of how likely it is for each document to be in each topic, which is incredible. I want to show you the result and then I want to show you how to tweak this a little bit.
There are three outputs. The example set, the topic list, and the model itself. I’m going to show this to you. This will take a moment, but you can imagine what it’s doing in the back and. The first response you get is simply telling you that you have 10 topics. Why did you have 10 topics? Well, it’s because I told it to do 10 topics. You could tell it to do any number of topics you want. That’s somewhat a decision. You’ll notice from before when I did the classification manually, I only did four topics. Here I just decided to do 10. Here is the word list, similar to what you saw from the text processing extension. However, here’s what makes this really quite interesting. It has created 10 topics what it calls topic ID. Here’s topic 0, topic 1, all the way down to topic 9. And it’s showing you the top words that it used to create each topic. It did it on its own following its own heuristic. And it attributes weights to each topic. You can see here in topic four the word, school professor, additional researchers were weighted very highly. Isn’t that interesting? If you look at this, topic four is all about students asking RapidMiner chat people for homework help. And I have to tell you, this happens all the time and the LDA operator automatically identify those conversations as the ones being from a student. We love students. We have no problem with it but maybe you should do your own homework. Similarly here, you’ll notice that this is a similar thing but people are looking for clustering work. You can look down here and look for all sorts of other different conversations. Here is somebody possibly for more marketing stuff. Schools and professors but also sales information. Here’s another one about Tableau and training. Here’s another one about installation and this kind of thing. So you can play with the number of topics any way you like. I want to go back to how you would tweak this. The LDA algorithm has two different heuristics or variables if you like that you use to go and decide how you want these topics to be created. Alpha and beta are how they are. The standard is 50 for Alpha and the standard is 50 as well for Beta belief. And again, it’s not the purpose of this webinar to really go through how you use those. If you simply do a google search for Alpha and Beta heuristics with Leighton’s virtually allocation, you will find it. The important piece here is that we are going to actually have all of these topics already created as opposed to trying to manually create a training set. This creates in-sense a set of classes on its own. It’s really quite amazing.
A couple of other things in the results is that it also creates, as I said, confidence levels of each document for each topic. So, for example, you can take any one of these. Let’s just take here. And you can see that the predicted class is topic 2.
Now, it doesn’t really tell you what topic 2 is. If you want to get a sense of what topic 2 is you’re going to go to your word list and look at the words that have found most useful and try to get a sense of the gist of it. This is obviously about downloading problems and versions and people having trouble with how that works. Once you’re here, you’ll see that the predicted class here is topic 2 and that’s because probabilisticly you can see that the highest probability in this row is .132. All these other probabilities are quite low .026, which is probably just random chance. I’m sorry, we use topic 2 . The next most likely one here is topic 1 at .132. That’s the most likely one. That’s where we get the predicted class. This is the next most likely one. You can see for the one below the predicted class topic 1 and that’s because the probability here is highest. And you can see, again, not only does it try to classify each conversation into a topic but it gives you confidence values for what would be the probability that it’s in that topic, and the probability that it would be in the next most likely topic, which I think is amazing. Lastly, it’s going to give you the model itself, which was actually this first tab. This is a model very similar to what you get from a TFIDF model or any kind of predictive analytics model. And we’re going to use that when we move to operationalization. So again, the point is we’re going to create a model from these training sets either through basically supervised learning which, is TFIDF. In some sense I would almost call this like an unsupervised learning, like a clustering algorithm where it’s going to find its own topics and we’re going to use those as a training set to go and then have the user of that chop that yielded the query, the server, get a response, and help that person better classify what is the gist of this person’s query.
I’m just going to go back to the agenda briefly. I’ve now done text mining of online chat conversations and I’ve done the categorization of unstructured data using LDA. I have one more here which is operationalization and I’m going to show that to you right now. So the first thing to do is apply these models. I’m going to go back to TFIDF and I want to show you that this is the technique you want to do if you have a new document and you want to classify it. How do I do this? Well, the first thing I’m going to do is simply get that document here. So, for example, I have here a document, it’s part of the chat bot, “Hello, I’m a university student. How do I download RapidMiner and get an academic license?” Very common query. I’m going to take that and run it through the same ETL processes I did before. Then I’m going to take my the TFIDF word list that I generated from earlier. I’m going to go and create a word vector of this one query using the process documents operator. Where do I go from here? I’m going to want to go and apply a predictive analytics model to that word vector in order to predict which class that query is going to be in. I need a model. So where do I get that model? To me the best way to get that model is from Auto model. So I’m going to take a quick break here and I’m going to show you how we do this in Auto model. The reason I use Auto model. It’s fast and simple and I can get a model that I can use to operationalize. I’m going to go into this webinar and I’m going to grab this message classification data set here. And then I’m going to run it through the TFIDF as I showed you earlier. And I said this already. This is the TFIDF set for Auto model, and if I just click next you will recognize this. This is the conversation ID, the text, the class, and the word vectors that you saw earlier.
I’m going to select the class as the column that I want to predict. I’m going to have it do it’s pre-processing I’m just going to run through. At this point I do want to point out that with word vectors even though auto model may not consider these particular classes to be useful– these word vectors to be useful. As you can see here, it’s saying that these columns are very similar, which makes sense. I am going to select all of them. And the only one I’m going to deselect here and I think it’s important that you override when needed is this conversation ID, which is obviously not useful. I’m also going to deselect the text itself because, obviously, that’s not useful. Everything else is useful. I’m going to run all these models and automatically optimize. One of the many amazing things about auto model – while this is running – is that it not only gives you the accuracy of each model, but it also gives you the runtime, the time it took to run this model. And this is an important trade-off when you start moving from development to deployment, and you can see here for example GLM, which is a very, very, very basic model, was very fast and very accurate. You can look at random forest here which is a very popular algorithm only 41% with TFIDF and very, very slow GBT isn’t even finished yet. A very popular algorithm. And this is a very common misconception that the more sophisticated the algorithm the more accurate it’s going to be for your dataset, which is not always true. This base often works very, very well. Deep Learning is a more sophisticated algorithm. It actually did not have the performance in the situation that GLM did have. You’ll notice the GBT is taking a long time. It’s running my processor very hard right now. You can see here all my CPUs are running like crazy, and it’s still taking a while. I’m going to let that run for a minute. Meanwhile, I’m going to look at the GLM because it does seem to be the most promising. And when you click here you will see the model of the very, very, basic linear model with coefficients. If you want to go and simulate what this is going to look like with different word vectors, you can. And you can also see a confusion matrix.
This is not a great model. It’s only 63%, but it has a very good precision of, No. It’ll probably over classify No. Which is actually what’s going to happen. And it does give a classification error. We’re still going to GBT. I’m just going to wait another moment. I think it does finish momentarily. I can show you very quickly what the decision tree looks like. And here’s what’s interesting. So if you go and the word educational license is there, that is the most likely way, obviously, and if I say the word, click license and so on. So the decision tree is useful because it does show you some insight. You can see it’s not a very sophisticated model. It’s basically discriminating to know and download license. It doesn’t give me any predictions for information or learning, which you can see from the confusion matrix it’s not as good a model as the GLM. You go to random forest, not as good a model, but you can see it does use some other word vectors here. So more is not always better. It’s still going, the GBT, and there it is. So the GBT is finished. It took a lot longer than any other method here with only a 57% accuracy. To me, the clear winner is GLM. So that’s the model I’m going to go with. I’m going to go to GLM. I’m going to take that model here and then here’s the magic button with RapidMiner auto model. This is the reason that RapidMiner auto model is by far the best auto modeler on the market right now. And it’s this button right here, open process. This is not a black box. Auto model builds your model in the background. It built this entire process while we were sitting there. And if you wanted to you could simply take this model and push it to operationalization. I’m not going to do that. I simply used auto model to prototype the most tuned model I could get without a lot of work. So I’m actually going to go here and I’m simply going to grab that model, just going to grab it and I’m going to put it in my model here. Or another way to look at that– I just backup here. I go back to auto model and open this process here, is you can run this model and port this model to a saved version. Either way we’ll give you the model that you need to score new coming datasets. I did it the other way. I ran this earlier and I took the model pulled out from here the GLM, ported it out and saved it and that here if you go to the data sets here you will see auto model GLM for TFIDF, that is simply taking this model from this port and saving it. So now let’s go here. That’s what this is right here. Then it could simply apply the models. So now I can take this text. Run it through TFIDF. Here’s the model I just created with auto model. I’m going to apply it and get a score. Again, just a reminder because if you’ve lost track, this is the text I’m going to score right here. I press, Go. Super fast. And here’s what’s interesting, the text is, “I’m a university student.” What was that again? “I’m a university student. How do I download RapidMiner and get an academic license?” In plain natural language text. The prediction? Download license. Thank you very much. With a confidence level of .522. The second most confidence level is, no and that’s because it wasn’t clear that any of these others would be very good. And you can see the high numbers for the word vectors of the words like academic and so on. So there’s our score. There’s our prediction. Just to show you that you can do this using LDA. I’m not going to do this when I push it to server, but just so you know, you would do the same process here in my ETL. This LDA model you create using the model using the– sorry, the operator in the operator tool box. And it’s this one right down here, Apply model. So you would run LDA and then you would do the, Apply model operator, which is the one I’m using right here. Take your document, put in the model, and get the output. One special thing you must note. You must create again a collection of documents, even if you only have one document. All the LDA operators require a collection of documents as an input.
I’m going to do this and score it and it simply gives me topic 6, which you remember earlier was the one about downloading and license and so on and so forth with a confidence level of .416. The beauty of this, I didn’t need to create a training set. This was an unsupervised methodology to go and create these topics, rather than a supervised version where I had to manually classify. Last thing. Let’s get this on some kind of model. We can actually use this in production. So here’s the last part. I’m going to push it to server. So what I’m going to do now is I’m going to go to my RapidMiner server and I’m going to duplicate what I’ve done here, and I’m going to do that in my server. And. I’m going to save this in the server. The reason it’s not saying it’s not allowed is because I’m running RapidMiner server A2– Studio A2. And my server is on A 1, but I’m going to show this to you on the server side right now. And I’m going to go here. And I’m going to go and show you on the server side how this works. You’re going to want to do a web service and you’re going to want to go and create a web service here. And this is my process, number 7, and I’m going to show you here what that looks like. The process is as follows. I’m going to give it a macro text and I’m going to call it, body. I actually don’t even need this multiplier operator. I was doing something else and then I’m going to apply that scoring process from earlier. Then I’m going to only select the prediction and then this is how you go and operationalize to a server app. You’re going to use the published app operator. I’m going to call it the TFIDF response. You don’t have to call it whenever you want. But this is how you push the response back to server. Again, so I go here. I’m simply saying, “Run this process.” And then if you remember before, you were in some sense you were creating your own API, so it wants to know what should the response look like? Well I’m going to choose a JSON or could choose XML, just like what Drift and Google Cloud did for us.
This has been pulled my RapidMiner process. And if I click, submit. I’ve already submitted it. I can then test it. And it says well, “Download license.” That’s not very useful. Let’s actually go and test it with something else. So let’s say, “Hello, I would like to learn about how to use RapidMiner decision trees and clustering algorithms.” And it predicts, “learn.” You can then take this result and put it in any front-facing web service. It’s really that simple. Here’s the URL you would use. You would set up your security protocols and off you go. So operationalization often is the hardest part in enterprise solutions, but in RapidMiner it’s actually probably the easiest part. So that’s the end of this webinar. Just in summary, this webinar was about how to go and use REST APIs in RapidMiner, how to do text mining, how to use the new LDA topic operator, and how to go and push those solutions into deployment. That’s the end of today’s webinar. Thank you very much for joining me. If you ever want to reach out to me directly I am the manager of the RapidMiner user community. And if you have questions you can go and post them onto this go to meeting and we will reply as best as we can. Thank you so much. Have a great day.