There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.
With the holiday season upon us, we wanted to update you about three new features available today in RapidMiner Studio. Two are updates for our Operator Toolbox extension—one makes it easier to understand your time-series data around the holidays, and one lets you wrap up tiny little gifts for your future self.
We’ve also released a new extension that let’s you access parquet data in RapidMiner.
Enhance your models with holiday information
When doing a time-series analysis, you’ll quickly discover that not every day is the same. For example, weekend patterns tend to be different than weekday patterns. As you continue to explore the differences between different kinds of days, you’ll also discover that holidays are an important factor in modeling time-series data. Demand on Christmas is obviously going to be different than demand on a random weekday in July.
Previously, you needed to use external tables in order to get this information into your models. But with our latest update to RapidMiner Studio, that’s no longer the case. Instead, you can use the new operator Get Holidays to easily load a list of all holidays in a given year for a given country. The operator also enables you to identify state holidays where they differ from national ones.
Unboxing the details
Let’s take a quick look at a use case for this Get Holidays operator. A while ago, I blogged about gas price forecasting using data that the German government releases on gas station prices. RapidMiner Studio has a time-series data set with one and a half years of data on the gas station right next to our Dortmund office in Germany; if you’d like to follow along with the example below, you can find it at //Samples/Time Series/data sets/Prices of Gas Station.
First, we need to do a little bit of prep. We’ll use a Windowing operator to get the last 24 hours to describe a given data point. We take the price in that 24-hour window as the label. This means we want to know the price in exactly one day. We then generate a new attribute having the date of tomorrow and convert it into a nominal (we’ll talk about why below).
Then we use the new Get Holidays operator to load in all Holidays in Germany between 2017-2019. We filter it for national holidays or for holidays containing NW. NW is the abbreviation for the German state of North-Rhine Westphalia, where Dortmund is located. Then, we join the two data sets.
We don’t want to learn based on the State or the TargetDate attributes in this case, which is why we’ve set it to a special role, so that it’s excluded.
Next, we need to ask ourselves how we want to use the holidays in our analysis and modeling. In this case, I decided to use an indicator “Holiday” for, well, holidays, and “NoHoliday” elsewhere. The reason I’m using simply “Holiday” instead of specific holiday names is that, depending on how much data you have, specific holidays are likely to be too infrequent to extract real insights—for example, models won’t be able to learn the difference between gas prices on Labour Day in May versus German Unity Day in October.
The result of applying these new categorical labels is a dataset with an additional column for whether or not the day in question was a holiday, as below. This can then be used in Auto Model or your favorite ML pipeline. The process is available as a tutorial process in the help panel of the operator.
Use caution with joins in time-series data
While we’re talking about time-series data, I’d like to take the opportunity to highlight how joins on date-time attributes work. Dates are stored internally as milliseconds since 1970, using the Unix timestamp system. By default, the Get Holidays operator sets the date to midnight of your system’s time zone. This can potentially have two confusing effects.
First, if I’m in Germany and want to query the holidays for the US, I may think that using midnight on July 4th will get me the US Independence Day. But because I’m in the CET time zone, this will actually be pulling data from 6PM on July 3rd in the US (assuming that the data was collected in the Eastern time zone of the US). Since this may be misleading, you can change the time zone in the operator.
The second issue may come up when you’re joining data. Keep in mind that joins are also done based on the Unix timestamp system. If I, with my German settings, join my data set with data retrieved in EST, it can create “weird results”. I recommend using one time zone per analysis, and potentially converting dates to nominals, when seeking to join data.
Give your future self a gift by caching intermediate results
When designing your processes, you often end up with processing steps taking a lot of time to run. Previously, you needed to create a dump of the data in your repository to speed up the process execution if you didn’t want everything to run every time. Whenever there was a change in the preparation process, you would need to recreate this dump.
But that’s now just a ghost of Christmas Past. The new operator Subprocess (Caching) allows for a very user-friendly and efficient method to store the results of a subprocess. How does it work?
The new operator Subprocess (Caching) contains an Automatic Feature Engineering operator, which you may know from Auto Model’s Feature Generation. This operator uses an evolutionary optimization strategy to find new features. This process can take quite a bit of time to run, often more than ten minutes. I may want to change some things and then iterate between changing Automatic Feature Engineering and the later postprocessing of the features.
The Subprocess (Caching) operator allows me to do this. After an initial execution, it will cache the output. If you run the same process again it will simply return the cached results, rather than re-executing the process. It will only execute again if there is either a change within the subprocess or different data being used as the input. It’s like a little gift you wrap up for your future self to save them time when trying out different things.
Accessing live sensor data through parquet
As if those two features weren’t enough, we’ve also released a new parquet extension along with this update. The extension contains an operator that allows you to read files in the popular parquet format. Parquet is a columnar format often used for large amounts of data. For example, you’ll encounter parquet files in Hive environments or when storing live sensor data from Internet of Things (IoT) devices in the cloud.
In the following example, we’ll look at sensor data that is pushed live from a manufacturing machine to an Amazon S3 bucket as a parquet file. Using RapidMiner’s connectors, we can quickly access the needed S3 bucket and forward the file to the new Read Parquet Operator.
Let’s have a look at the resulting data set as is after fetching it from S3 (after the break point) and reading it with the Read Parquet Operator.
As you can see, it’s a sparse data set with many missing values. When storing sensor data, it’s often the case that only changes are logged, meaning that the resulting data set will have missing values. Using the Replace Missing Value (Series) Operator, we can quickly fill in the missing values with the last non-missing value, which represents the former machine state.
And we’re good to go. This data is now in a nice format and can be used for a machine learning task.
Keep in mind that we have many different operators that allow you to easily access files from other cloud storage systems like Azure Data Lake, Google Cloud Storage, and more.
In addition to these big gifts up above, we also have a few small additions and bug fixes for you with this update.
- Extract Macro (Format) is now able to also extract the number of items in a collection.
- Random Forest and GBT of SMILE are now able to do both regression and classification.
- Compare Distribution of SMILE extension supports Kullbeck Leibner Divergence and Jannson-Shannon Divergence.
- Optimize Threshold and Optimize Threshold (Subprocess) now log their results.
- Several bugfixes in Compare Distribution.
- Fixed a bug in ‘Smote Upsampling’ which treated dates as nominal attributes and caused crashes. Dates are now treated like any other numerical attribute.
- Fixed a bug where Sample (Collection) would mostly take items from the beginning of a collection in Bootstrap mode.
- Removed the check that Sample (Collection) wants to have sample size less than collection size for bootstrapping. This allows you to oversample.
You must spend time on data exploration; you must think about the problem you’re trying to solve, bring the right data together and then inspect it.
Data quality refers to the right type of data being in the right place. Learn how to improve the quality of your data by replacing missing values.