Data mining is the process of uncovering patterns inside large sets of structured data to predict future outcomes. Structured data is data that is organized into columns and rows so that it can be accessed and modified efficiently. Using a wide range of machine learning algorithms, you can use data mining approaches for a wide variety of use cases to increase revenues, reduce costs, and avoid risks.      

If you are looking to analyze unstructured data (e.g. data from essays, articles, computer log files, etc.) see text mining 

Data mining process and tools

The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a conceptual tool that exists as a standard approach to data mining. The process outlines six phases:

  1. Business understanding  
  2. Data understanding  
  3. Data preparation  
  4. Modeling  
  5. Evaluation  
  6. Deployment 

The first two phases, business understanding and data understanding, are both preliminary activities. It is important to first define what you would like to know and what questions you would like to answer and then make sure that your data is centralized, reliable, accurate, and complete.  

Once you’ve defined what you want to know and gathered your data, it’s time to prepare your data – this is where you can start to use data mining tools. Data mining software can assist in data preparation, modeling, evaluation, and deployment. Data preparation includes activities like joining or reducing data sets, handling missing data, etc.  

The modeling phase in data mining is when you use a mathematical algorithm to find pattern(s) that may be present in the data. This pattern is a model that can be applied to new data. Data mining algorithms, at a high level, fall into two categories – supervised learning algorithms and unsupervised learning algorithms. Supervised learning algorithms require a known output, sometimes called a label or target. Supervised learning algorithms include Naïve Bayes, Decision Tree, Neural Networks, SVMs, Logistic Regression, etc. Unsupervised learning algorithms do not require a predefined set of outputs but rather look for patterns or trends without any label or target. These algorithms include k-Means Clustering, Anomaly Detection, and Association Mining.

Data evaluation is the phase that will tell you how good or bad your model is. Cross-validation and testing for false positives are examples of evaluation techniques available in data mining tools. The deployment phase is the point at which you start using the results.

Introducing RapidMiner: Data science for the enterprise

RapidMiner Studio is a powerful data mining tool for rapidly building predictive analytic workflows. This all-in-one tool features hundreds of data preparation and machine learning algorithms to support all your data mining projects.

Lightning Fast Data Science

RapidMiner Studio is a visual workflow designer that lets data scientists use machine learning to produce insights on any data at any scale.

Unified Platform

Replace multiple IBM products with one. Prep and blend data, create predictive models, and deploy into production—all in a single tool. No coding required.

Open to Any Data, Any Machine Learning Algorithm

RapidMiner Studio is a future-proof open source data science platform. Add new functionality from the RapidMiner Marketplace, and re-use existing R and Python code if you have it.

RapidMiner Studio in 60 seconds


Organizations like these use data mining tools from RapidMiner