This blog post was co-written by RapidMiner founder and CTO Ingo Mierswa and Ivan Tenger, an engineer here at RapidMiner.
Have you ever bought or sold a house? If you have, you know how stressful it can be. Buying a home is single the largest investment that most people ever make. And although many of us spend tons of time researching product reviews online or chasing down the lowest price when we buy a new TV, the stress of making a house purchase sometimes leads to a less informed decision. We see a house a couple of times, pay somebody for some superficial assessments, and then make an above-market offer anyway because we can see our families in it.
One of RapidMiner’s engineers is in the market for a new house, and we’ve been talking internally about the factors that influence the price of a home. And as a data science software company, we couldn’t help but look for a data set that we could feed into our machine learning models to help us to understand what factors have the biggest impact on home purchase prices here in Boston.
We also wanted to explore some new functionality of RapidMiner Go that lets you build a model and then embed a shareable simulator to easily communicate results with stakeholders. If you’d like to skip ahead, you can jump to the bottom of this post and play around with the simulator a bit . Otherwise, read along about how we built it using Go!
Understanding Home Prices
Not too long ago, we suffered a financial crisis which was largely driven by inflated real estate prices. But only a couple of years later, it seems that prices are hitting new highs again. Some claim that there’s another bubble coming, but that’s always true—we just don’t know when the next bubble will burst.
And while we haven’t tried to predict when the current market will pop, we did want to understand what’s driving current housing. Is it the quality of local schools? The clean air? The distance to employment options?
Well, we will find out next. And you can play around with the results yourself, thanks to our new sharable model simulator which is part of RapidMiner Go.
Step 1: The Data
Step one in answering our question is finding a good data set. For our purposes, we didn’t spend a lot of time on this search, so we just picked the first one we found. It is called the Boston Housing data set, and although it’s from the 1970s, we still figured it would be able to help us understand some of the factors that influence housing costs.
The data consists of 506 rows describing different neighborhoods of Boston and some of its suburbs. The original data set has 14 different columns of data, but we excluding one about race from our analysis, as there was a lot of discussion around this dataset that the way race was calculated might not be accurate and equitable.
The remaining 13 columns describing Boston neighborhoods are:
- Crime rate per capita: the average crime rate per capita
- Residential proportion: the proportion of residential buildings
- Business proportion: the proportion of business buildings
- Charles River: if the neighborhood is close to the Charles River
- Nitrogen oxides concentration: this is, well, the concentration of nitrogen oxides
- Rooms per dwelling: the average number of rooms per dwelling
- Pre-1940 proportion: the proportion of buildings built before 1940
- Distance to employment centers: the average distance to the five main employment regions in Boston
- Highway accessibility: how accessible highways are from a given neighborhood
- Property tax rate: the property tax rate
- Pupil-teacher ratio: how many pupils are taught by a single teacher on average
- Lower income proportion: the proportion of lower income households
- Price median: the median selling prices for houses in the neighborhood
The last one, the median price for houses sold in a given neighborhood, is what we are going to predict with RapidMiner Go. Because the data are from the 1970s, the median prices are mostly below US$50,000 (the numbers in the data are for thousand dollars each). Too bad those aren’t the current prices…
It is also worth noting that the nitrogen oxides concentration was the original motivation of the researchers. They wanted to analyze the impact of clean air on house prices. Well, let’s see how important this really is.
Step 2: Build models
To quickly build a model, we can load this data into RapidMiner Go. After selecting the Price Median column as target, we could see the distribution of prices:
Most houses were sold in the US$20,000 range in the 70s. Ok, sounds fine. Let’s inspect the columns to see what else we have happening:
Go is actually suggesting that we remove the column for the Charles River adjacency. It is pretty stable and will probably not help much. Although Go is right with that assessment, we decided to keep it in, just to see it for ourselves what it does to the final model.
Go also points out that both the Rooms per Dwelling column and the Lower Income Proportion column have a high correlation with the Median Price column.
Both of these make intuitive sense—more rooms means bigger houses, which of course means higher prices. Likewise, you should expect people with lower incomes to own homes that are less expensive
You may decide to remove the column because it is unclear if the reason for lower incomes are the lower house prices or the other way around. If this would be a model for a real project, we probably would be glad about the warning from Go and decide to remove the column, but because we want to be able to explore more in our simulator, we left everything in here and clicked through!
Let’s check out the results next.
Step 3: Model evaluation
After a few minutes, Go is done with its magic and we’re presented with a couple of models to look at:
Random forest, gradient-boosted trees, deep learning, and support vector machines all performed very well. The decision tree did not fully grasp the patterns in the data. And as we will soon see, there are some non-linear relationships in the data which accounts for why the linear model performed a bit worse.
Anyway, we actually do not care that much about the models themselves but want to understand what is driving real estate prices up or down. We simply selected the best model of the bunch (gradient boosted trees) and inspected the weights for the different influence factors:
As we expected, the number of rooms ended having the greatest influence on house prices—not much of a surprise at all that larger houses are more expensive.
But the second and third most important factors were a surprise: highway accessibility followed by the nitrogen oxides concentration. It would thus appear that clean air is an important factor influencing house prices, although much less important than the other two factors. And although it’s almost on the same level as access to shops or good schools, those two factors are rather obvious and shouldn’t be a surprise for anybody living in a city.
While this is a great overview which gives us a good idea about what is generally important, it does not tell us what drives the price in specific situations. And it also hides any non-linear effects. And finally, if you would like to know how much your house would be valued in the 70s, you cannot see that from those weights either.
Those are exactly the reason why we have invented and built the model simulator. It allows you to play around with the different influence factors and see the reaction of the model in real time. And you can now even share and embed the simulator outside of Go, too, just as we’ve done below!
On the left side you can see all the inputs to our model. And on the right side you can see the prediction of the house price for the settings on the left. Below the prediction you can see what drives the value up and down in that specific situation.
Try for example to move the slider for Rooms per Dwelling to the right. After you stop dragging, you’ll get a new house price that’s higher than before. If you move the number back down again, the price will adapt and will also lower again. Nice.
Move the slider for the rooms to the far left. This means we will focus on neighborhoods with smaller houses. If you move now the slider for highway accessibility to the far left and right, you can indeed see that better accessibility indeed drives higher prices, but not by a lot. This effect is even smaller for the largest houses which you can easily check yourself. On the other hand, you will notice that very high crime rates have a much bigger negative impact than we expected simply based on the global model weights above.
Finally, let’s inspect the clean air factor. Move the slider for the nitrogen oxygen concentration to the far left and see that the prices for houses in areas with clean air indeed are a bit higher. For most situations, you may even notice some non-linear effect here: both the lowest and the highest concentrations lead to highest prices. This can be easily explained since the highest concentrations of nitrogen oxygen can be observed in the center of Boston where prices are generally higher.
And this is why we love the simulator so much! Just based on the model itself and the weights for the model, you might have expected a much bigger or much smaller impact of these various factors. But with the simulator, you can get a better “feeling” about the importance of different factors by playing around with the different inputs.
Try to define your own house if you like, or just play around with the simulator. It’s a great tool to understand the underlying patterns in your data and can help you to build trust in your models. And when you’re done with this, sign up for Go, upload your own data, and start modeling and building embeddable simulators.
New to RapidMiner? Here's our end-to-end data science platform.
Get a complimentary copy of the 2020 Forrester Wave: Multimodal Predictive Analytics And Machine Learning Solutions