stack of books

08 September 2022


A Deep Dive Into RapidMiner’s NLP Extension

Now more than ever, enterprises rely on understanding their brand perception—both from their customers and their employees. Text analytics allows them to tap into structured and unstructured data so they can proactively respond to trends in feedback, whether they’re positive or negative. 

To support the rise of NLP, we launched our NLP (Natural Language Processing) extension in October 2021, which is accessible to all RapidMiner users. Since then, we’ve introduced and continue to enhance capabilities that will make it easier for customers to create cutting-edge solutions in a variety of fields including text analytics, sentiment analysis, and predictive text. 

In this post, we’ll walk you through the new NLP extension and show you how it blends the intuitive RapidMiner process architecture with the speed and versatility of text analytics tools like Stanford CoreNLP. 

The Importance of Text Analytics 

Text analytics describes the process of using NLP to convert unstructured text data into structured data that can be interpreted by machine learning models.

Finding pertinent information quickly among a vast amount of text is essential for making business-critical decisions in knowledge-driven sectors, making text analytics an essential tool in an organization’s toolbox. 

Use Case Highlight for Text Analytics & NLP: Sentiment Analysis 

Customer sentiment analysis is one of the most common ways NLP is leveraged to create value in organizations. Here’s a high-level overview of the process: 

RapidMiner’s New NLP Extension 

RapidMiner’s NLP extension enables users to efficiently create pipelines for text analytics by offering a simple design that meets the needs of the users with varying data science skill levels.  Here’s how it works. 

An annotation pipeline can be applied to any text, even a paragraph or an entire paper, rather than just a single sentence—which was one of our main design goals when evaluating how to make this extension more widely applicable. It’s also relatively straightforward to set up and execute processing pipelines.  

We describe the available annotators, concentrating on the English translations. It should be emphasized that while some of the models underpinning annotators were learned via supervised machine learning using annotated corpora, others were rule-based components, which frequently needed their language resources. 

Using RapidMiner’s NLP Tagger

To employ this pipeline operator, your model needs to know which single text column the text processing should be performed on. This is selected via the ‘text attribute’ parameter.

The language can be selected either by changing the advanced language parameter or by changing the default language.

Enabling the named-entity-tagging via the ‘ner tagging’ checkbox adds a new column to the output containing identified named-entity tags. It also provides the advanced option of reading in a user-provided list of custom NER tags and corresponding text snippets.

Enabling the dependency parsing option adds two more columns with in- and outgoing dependencies for each token. Likewise, each time a new parameter is chosen, a new column indicating the results of multiple operations will be displayed. 

Building NLP Processes in RapidMiner 

Building an NLP process relies on the operator called NLP Tagger, which executes multiple operations ranging from tokenization to sentiment analysis. Here are some more details. 

NLP system architecture

Tokenization (tokenize): Creates a series of tokens from the text so that your unstructured string transforms into a numerical data structure more suitable for machine learning.  

Sentence splitting (ssplit): Splits a sequence of tokens into sentences so that individual sentiments can be analyzed without too many additional variables. 

Truecasing (truecase): Determines and restores text into proper capitalization, making it easier to detect proper nouns. This is especially useful for tasks requiring text translation. 

Part-of-speech tagging (pos): Assigns tokens with their part-of-speech (POS) tag—this can also usually identify verb tense as well. POS tags can identify speech patterns, making it easier to determine the intent of particular words across a large volume of text. 

Morphological analysis (lemma): Transforms similar tokens into their base root forms—for example, “leafs” and “leaves” would both become “leaf.” This is useful to understand how a word is being used in a particular context, say, when a customer converses with a chatbot online.  

Named Entity Recognition (ner): Recognizes words or phrases that are the names of people, places, organizations, etc. using a combination of CRF sequence taggers trained on various corpora. NER can help identify pertinent terms across a large body of text and is particularly useful in document classification. 

Syntactic parsing (parse): Provides full syntactic analysis, including both constituent and dependency representation, based on a probabilistic parser to demonstrate how text is structured and determine the relationship between words, which is essential for accurate text analytics. 

Sentiment analysis uses deep learning to give a sentiment score to each “node” of a binarized tree of each sentence. This can be performed on product reviews, for example, to help monitor how customers are responding to a new product. 

Applications for the NLP Extension 

The NLP extension can be used across industries, notably in:


When financial advisors don’t make the proper disclosures in “client advice” documents, they run the risk of breaking the law. These disclosures might include information about potential conflicts of interest, commissions, and credit costs. Text analytics can convert hours of human labor into minutes of algorithmic work—allowing financial institutions to hyper-personalize the client experience and increase profitability.  


In today’s banks, customers rely more on mobile banking, automated tellers, and paperless statements—many banks don’t even have physical locations anymore. By utilizing text analytics, banks can improve their risk management efforts, remove bias from the offerings, and make data-driven decisions about the products and services.  

Retail and E-commerce 

Text analysis can help e-commerce businesses understand their consumers’ behavior, which can be leveraged to increase sales. They can do this by monitoring the buzz around new products across channels, identifying problem areas and recommendations for improvement, and enhancing their services with this feedback. 


Studying vast amounts of unstructured text-based data, such as nursing notes, clinical agreements, prescriptions for medications, and medical publications, is one of the biggest challenges for healthcare analytics. Text analytics is becoming more well-known for its ability to help healthcare organizations to improve patient-doctor connections, save time on interpreting medical reports, and detect fraudulent activities associated with unfit prescriptions, referrals, counterfeit insurance claims, or medical bills.  

Wrapping Up 

We built RapidMiner’s NLP extension to provide an easy-to-use approach that enables users to run NLP processes that can, and will, make an impact at their organization. The extension is based on Stanford CoreNLP suite, which is a collection of pre-trained state-of-the-art models, and it comes with flexible configurations to best suit your use case. 

If you’re already a RapidMiner user and ready to utilize cutting-edge features which are not necessarily easy to find on industrial NLP platforms, download the extension from the Marketplace. You’re also welcome to start a discussion in our Community to share your experience or ask any questions about how to use it. 

If you’re not a RapidMiner user yet, you can request a demo today. This extension is just one example of the ways RapidMiner can streamline your work and integrate with your current enterprise analytics landscape. 

Related Resources