RapidMiner Radoop
Feature List
RapidMiner Radoop extends common RapidMiner in-memory functionality by providing sophisticated operators that are implemented for in-Hadoop execution. Radoop includes more than 60 operators for data transformations as well as advanced and predictive modeling that run on a Hadoop cluster in a distributed fashion.
Easy to maintain and develop Visual Programming Environment
Integration of SparkR scripts running on your own environment within the visual processes
Integration of PySpark scripts running on your own environment within the visual processes
Automatic Execution of Analytic Workflows into Hadoop (run the process where the data is)
Purely functional operators for data access, data preparation and modeling. The technology becomes transparent.
Supports Cloudera, Hortonworks, Amazon EMR, Apache, Microsoft’s Azure HDInsight (Other Hadoop distributions may be integrated by specifying the proper libraries and dependencies)
Supports Kerberos authentication
Supports data access authorization employing Apache Sentry & Apache Ranger
Supports HDFS encryption to seamlessly integrate with data security policies
Supports Hadoop impersonation
Transparent data exchange between local memory and cluster
Push any RapidMiner operator or subprocess (including extensions) down to Hadoop and execute in a parallel way
Supports Hive on Spark and Hive-on-Tez
Smart optimization of processes by grouping requests and reusing Spark containers as much as possible
Visualization of sampled Hadoop data within Studio
- Easier-than-ever configuration: All the Hadoop and Spark settings and variables can be automatically imported by Radoop
ETL Capabilities
Read, store and append from and to Hive tables
Read CSV (from HDFS, Azure Blob or Datalake, Amazon S3 or local filesystem)
TEXTFILE, ORC, SEQUENCEFILE, PARQUET and RCFILE formats supported
Select Attributes, Sample, Filter Examples and Ranges: select a subset of the data according to various criteria and drop non-matching records and attributes Sample
Generate Attributes, Generate ID, Generate Rank: define new attributes with more than a hundred functions including mathematical and string operations
Aggregate: calculate aggregate values like averages and counts
Join: combine multiple data sets based on simple or
complex keysSort: order data sets according to different attributes
Normalize: transform numeric values to fix ranges or variances
- Pivot Table: summarize data and change table representation Replace: replace specific values and fix wrong data formats
- Replace: replace specific values and fix wrong data formats
- Replace and Declare Missing Values: handle missing values in various ways
- Remove Duplicates: remove duplicate records that got there by error
- Split Data, multiply: branch the process or partition the data
- Store, Materialize, Append, Union: store and combine data results in Hive or Impala
- Drop, Rename, Copy Table: manage Hive or Impala tables
- Loop and Loop Attributes: organize loops for fixed iterations or over the attributes
- Hive Script and Pig Script: implement custom data transformations in HiveQL or Pig
Modeling
K-Means clustering
Principal Component Analysis
Correlation and Covariance Matrix
Naive Bayes
Logistic Regression
Decision Tree
Split Validation: evaluate model performance