Cherry blossoms in Japan

15 September 2022


How KSK Analytics Used Machine Learning to Predict Cherry Blossom Season in Japan

You’ve probably heard of cherry blossom season in Japan. But, you likely didn’t know the impact that the season has on the Japanese economy—entities from grocery stores to railroad operators sees an influx in business. 

Accurately forecasting when peak cherry blossom season will occur allows suppliers to ensure they can manage increased demand. Kanako Shibata, a data analyst at KSK Analytics, Inc used RapidMiner to develop a machine learning model that can predict the cherry blossom trees’ blooming date with an error margin of just 2-3 days. 

In this post, we’ll give you an overview of how they built the model, its resulting impact, and future use cases. 

Use Case Highlight: Predicting Sakura Blooming with RapidMiner 

Please note that this post is based on a Solution submitted to RapidMiner Solution Goldmine. To see the full documentation, click here

Data Sources 

Creating a viable model always starts with collecting the right data. In this case, KSK Analytics gathered weather data from the Japan Meteorological Agency and used relevant attributes such as number of hours on sunshine, elevation, the number of days from March 1st until the peak bloom date, and latitude and longitude for 48 major cities across Japan. 

Alongside the obtained data, they also aggregated the daily temperature data to determine the mean, maximum, minimum, and standard deviation of daily temperatures on a monthly basis. 

Eventually, they joined the two datasets and removed unnecessary attributes before training the model. 

Methodology & Modeling 

After finetuning the perfect dataset, the next step was to build a model with the lofty goal of predicting the cherry blossoms’ peak bloom date with a high degree of accuracy. 

To do so, KSK Analytics built regression models from the constructed dataset (using (GLM, linear regression, k-NN, SVM, decision tree, random forest, neural net, and deep learning) and compared the results of a 10-fold cross validation

The model was trained and validated on data from 2018-2021, then the best performing model was evaluated on the 2022 data. Three performance metrics were used to evaluate the model’s effectiveness: root mean squared error (RMSE), the correlation coefficient, and the coefficient of determination (squared correlation = R2). 

Before modeling, the number of attributes was reduced from 53 to 36 by first removing attributes with a weight of less than 0.1, then by removing attributes that distributed multicollinearity, which can cause the regression equation to become numerically unstable. 

The project’s goal was to create a model that not only fit the actual 2022 values, but could also correctly predict future values and would be sustainable for the future. 

Model process displaying attribute selection 

Results & Impact 

After rigorous testing, the Random Forest forecasting model produced the most accurate results, with an RMSE of 3.058 days, which was very close to the initial target error margin of 2-3 days.  However, as seen in the figure below, some cities are more accurate than others. In less accurate locations, it might be useful to consider new attributes that have a greater correlation to Sakura blooming.

Absolute prediction error for each city in 2022 

Despite its limitations, the resulting model can make its prediction at least three weeks before the blooms occurred. So, it allows businesses to more accurately forecast demand and make smarter decisions—such as when retailers should stock up on cherry blossom-related snacks and when train operators should alter their schedules to accommodate crowds. 

To Wrap Up 

For future cherry blossom seasons, KSK’s model can offer businesses a distinct advantage—sales planning and inventory management can increase profitability and efficiency for organizations across the board. 

To get an in-depth look at the specific attributes and processes used, please refer to the full solution documentation here

Our Solution Goldmine offers a great opportunity to get your process peer-reviewed and published so you can share your work with the broader data science community. If you have your own RapidMiner-powered use case, we’d love to see how you built it! 

This blog post is guest written by Kanako Shibata of KSK Analytics,inc, a Japan-based business consulting firm specialized in AI & BI strategic consulting, training, and solutions. Kanako is a strong advocate of the importance of learning and upskilling in delivering machine learning programs, developing and delivering education programs.

Related Resources