RapidMiner Radoop extends common RapidMiner in-memory functionality by providing sophisticated operators that are implemented for in-Hadoop execution. Radoop includes more than 60 operators for data transformations as well as advanced and predictive modeling that run on a Hadoop cluster in a distributed fashion.

  • Easy to maintain and develop Visual Programming Environment

  • Integration of SparkR scripts running on your own environment within the visual processes

  • Integration of PySpark scripts running on your own environment within the visual processes

  • Automatic Execution of Analytic Workflows into Hadoop (run the process where the data is)

  • Purely functional operators for data access, data preparation and modeling. The technology becomes transparent.

  • Supports Cloudera, Hortonworks, MapR, Amazon EMR, Apache, Microsoft’s Azure HDInsight (Other Hadoop distributions may be integrated by specifying the proper libraries and dependencies)

  • Supports Kerberos authentication

  • Supports data access authorization employing Apache Sentry & Apache Ranger

  • Supports HDFS encryption to seamlessly integrate with data security policies

  • Supports Hadoop impersonation

  • Transparent data exchange between local memory and cluster

  • Push any RapidMiner operator or subprocess (including extensions) down to Hadoop and execute in a parallel way

  • Supports Hive on Spark and Hive-on-Tez

  • Smart optimization of processes by grouping requests and reusing Spark containers as much as possible

  • Visualization of sampled Hadoop data within Studio

  • Easier-than-ever configuration: All the Hadoop and Spark settings and variables can be automatically imported by Radoop

ETL capabilities

  • Read, store and append from and to Hive tables

  • Read CSV (from HDFS, Azure Blob or Datalake, Amazon S3 or local filesystem)

  • TEXTFILE, ORC, SEQUENCEFILE, PARQUET and RCFILE formats supported

  • Select Attributes, Sample, Filter Examples and Ranges: select a subset of the data according to various criteria and drop non-matching records and attributes Sample

  • Generate Attributes, Generate ID, Generate Rank: define new attributes with more than a hundred functions including mathematical and string operations

  • Aggregate: calculate aggregate values like averages and counts

  • Join: combine multiple data sets based on simple or
    complex keys

  • Sort: order data sets according to different attributes

  • Normalize: transform numeric values to fix ranges or variances

  • Pivot Table: summarize data and change table representation Replace: replace specific values and fix wrong data formats
  • Replace: replace specific values and fix wrong data formats
  • Replace and Declare Missing Values: handle missing values in various ways
  • Remove Duplicates: remove duplicate records that got there by error
  • Split Data, multiply: branch the process or partition the data
  • Store, Materialize, Append, Union: store and combine data results in Hive or Impala
  • Drop, Rename, Copy Table: manage Hive or Impala tables
  • Loop and Loop Attributes: organize loops for fixed iterations or over the attributes
  • Hive Script and Pig Script: implement custom data transformations in HiveQL or Pig

Modeling

  • K-Means clustering

  • Principal Component Analysis

  • Correlation and Covariance Matrix

  • Naive Bayes

  • Logistic Regression

  • Decision Tree

  • Split Validation: evaluate model performance