AI has practically limitless applications—especially for organizations struggling with supply chain delays, fraud, and changing customer expectations—but working with textual data is one of the most intriguing.
We encounter textual data every day—our lives are filled with continuous reading, parsing, and understanding text input. But, working with text is also one of the most challenging data science tasks, as texts can have multiple meanings, are prone to interpretation, and are full of contextual and cultural nuances.
Based on the findings of a recently finished project by the Research Team at RapidMiner, we want to highlight common challenges of working on large NLP (Natural Language Processing) projects and share some best practices.
A Bit of Background on the TechRad Project
The goal of the TechRad project was to create an autonomous technology scouting engine that helps engineers and CTOs discover and monitor emerging technology trends within their area of expertise.
Technology scouting helps businesses stay on the cutting edge and understand what’s going on in their space right now. But it’s also time- and resource-consuming when done internally and expensive when done by professional agencies.
TechRad’s vision was to collect and analyze a large volume of documents and use that textual data to build an evaluation engine based on state-of-the-art NLP algorithms. This engine would be able to find new technologies, determine their application areas, and evaluate their technology readiness level. The whole project was supervised by legal experts that checked potential repercussions the project might have in practice.
Overcoming Key NLP Project Challenges
Text mining projects are typically pretty nuanced (due to the interpretation challenges we mentioned before). Here are the two principal challenges we encountered during the TechRad project and what we did to overcome them.
Good data is the wings all data science projects need to fly.
When we started the project, we didn’t have an existing dataset, so we first had to collect data. The plan was to have a continuously updated system that could detect emerging trends, as a static document collection would become outdated quite fast.
We used two methods to execute this:
- Web crawling to automatically start looking for interesting websites, following links to further articles, and downloading texts we found along the way. The issue with web crawling is that there’s no guarantee that the downloaded documents are relevant (a data science challenge on its own). Also, the data is rarely in a uniform format and can include all kinds of unwanted noise (html tags, navigational frames, etc.).
- API access to directly query selected sources that return data in a structured way. The benefits of this approach are that you have more control over the kind of data you retrieve, and the fixed format helps organize the results. The drawbacks are that often APIs restrict their usage, can sometimes be expensive, and in contrast to the web crawling approach, you only get what you’re already explicitly looking for.
For the training and evaluation phase, we used research papers from the arXiv.org repository of over 1.7 million documents and Wikipedia articles about selected technologies.
Asking the Right Questions and Defining the Correct Scope
This is one of the greatest challenges in any data science project, but the ambiguous goal of interpreting texts and coming to meaningful conclusions makes it even more difficult.
To start with, the simple question of “What actually is a technology?,” led to many long discussions within our group. Even without considering any edge cases, this is way harder to define than, for example, to determine if there’s a cat in a picture or not.
In the end, the definition we settled for was “A technology is a distinguishable item, such as an algorithm, a commercial product, or software library or framework.” The next step was to create a training dataset so we could train a machine learning algorithm to follow our definition and recognize new technologies based on the textual context.
For training, we used the open-source tool doccano, which makes it extremely easy to label text snippets. We set up a doccano instance, fed it with several thousands of research papers publicly available on arxiv.org, and in a distributed effort labeled over two thousand text snippets in a few days. The model trained with this data was able to successfully identify new technologies, like specific algorithms or production procedures.
With the labeled data, we could complete our pipeline and build our demonstration of the technology scouting engine. It included:
- Automated keyword extraction to highlight topics and common themes
- Technology detection and extraction via Named Entity Recognition (NER) with fine-tuned algorithms
- Acronym extraction and mapping (as you know industry people love their acronyms 😊)
- Classification of the technical readiness level (TRL) of a found technology
When working through an NLP project, having properly labeled data and a clear understanding of the problem you’re trying to solve are the keys to success.
With TechRad in particular, we were interested to see how powerful modern, deep learning models are for extracting knowledge out of large text collections. We’ll take our findings from experience and distill it into our upcoming works on integrating these specialized deep learning models into our product suite.
For the technically inclined, we also plan to provide you with more deep dives and overviews on the workings of transformer based deep neural networks, such as BERT and its current successors. Stay tuned!
Want to learn more about what’s new at RapidMiner? Check out the blog post from our founder, Ingo, on our next gen platform launch.