RapidMiner Radoop extends common RapidMiner in-memory functionality by providing sophisticated operators that are implemented for in-Hadoop execution. Radoop includes more than 60 operators for data transformations as well as advanced and predictive modeling that run on a Hadoop cluster in a distributed fashion.

  • Easy to maintain and develop Visual Programming Environment

 

  • Integration of SparkR scripts running on your own environment within the visual processes

 

  • Integration of PySpark scripts running on your own environment within the visual processes

 

  • Automatic Execution of Analytic Workflows into Hadoop (run the process where the data is)

 

  • Purely functional operators for data access, data preparation and modeling. The technology becomes transparent.

 

  • Supports Cloudera, Hortonworks, Amazon EMR, Apache, Microsoft’s Azure HDInsight (Other Hadoop distributions may be integrated by specifying the proper libraries and dependencies)

 

  • Supports Kerberos authentication

  • Supports data access authorization employing Apache Sentry & Apache Ranger

 

  • Supports HDFS encryption to seamlessly integrate with data security policies

 

  • Supports Hadoop impersonation

 

  • Transparent data exchange between local memory and cluster

 

  • Push any RapidMiner operator or subprocess (including extensions) down to Hadoop and execute in a parallel way

 

  • Supports Hive on Spark and Hive-on-Tez

 

  • Smart optimization of processes by grouping requests and reusing Spark containers as much as possible

 

  • Visualization of sampled Hadoop data within Studio

ETL capabilities

  • Read, store and append from and to Hive tables

 

  • Read CSV (from HDFS, Azure Blob or Datalake, Amazon S3 or local filesystem)

 

  • TEXTFILE, ORC, SEQUENCEFILE, PARQUET and RCFILE formats supported

 

  • Select Attributes, Sample, Filter Examples and Ranges: select a subset of the data according to various criteria and drop non-matching records and attributes Sample

 

  • Generate Attributes, Generate ID, Generate Rank: define new attributes with more than a hundred functions including mathematical and string operations

 

  • Aggregate: calculate aggregate values like averages and counts

 

  • Join: combine multiple data sets based on simple or
    complex keys

 

  • Sort: order data sets according to different attributes

 

  • Normalize: transform numeric values to fix ranges or variances

  • Pivot Table: summarize data and change table representation Replace: replace specific values and fix wrong data formats

 

  • Replace: replace specific values and fix wrong data formats

 

  • Replace and Declare Missing Values: handle missing values in various ways

 

  • Remove Duplicates: remove duplicate records that got there by error

 

  • Split Data, multiply: branch the process or partition the data

 

  • Store, Materialize, Append, Union: store and combine data results in Hive or Impala

 

  • Drop, Rename, Copy Table: manage Hive or Impala tables

 

  • Loop and Loop Attributes: organize loops for fixed iterations or over the attributes

 

  • Hive Script and Pig Script: implement custom data transformations in HiveQL or Pig

Modeling

  • K-Means clustering

 

  • Principal Component Analysis

 

  • Correlation and Covariance Matrix

 

  • Naive Bayes

  • Logistic Regression

 

  • Decision Tree

 

  • Split Validation: evaluate model performance