Predicting Housing Prices

Sachin Naik
5 min readJun 25, 2021

Prediction and Analysis of Properties in King County, Washington

For most Americans, their biggest financial investment is their house. So, when it’s time to buy a house, a potential buyer would research location, size, number of bedrooms, number of bathrooms…and the list goes on. All of these features affect desirability for the buyers and therefore affect house prices. The goal of this analysis is to predict housing prices based on their features. To do this, I will be using a Multiple Linear Regression model along with the housing data from King County in the state of Washington, which is available here.

Data and its features: The downloaded data did not require any major cleaning, but we did include additional features such as income (census.gov) and distances to various landmarks, such as parks and major attractions (see visuals 5 and 6). The distances were calculated using pgeocode, which is a python library that can be used for postal code geocoding and distance calculations. You can find more information on this here.

There was a reasonable amount of feature scaling, preprocessing, and cleaning of the data (you can refer to my Jupyter notebook for further details). After cleaning the data, one of the first things we needed to do was to remove redundant features. We need to do this without compromising too much on accuracy involved in predicting the price. For this we can use a heat map for visualization and a metric such as variation inflation factor (VIF). The goal is to remove features that are highly correlated. This is especially important when you are trying to make predictions with a lot of features. One can compare having too many features for prediction to having too many chefs in the kitchen. The outcome can get messy. In this project, we used 0.7 as a threshold for high correlation. This means any pair-wise correlation of features greater than 0.7 or smaller than -0.7 needs to be removed. For instance, most of the distances are collinear with population density(calc_pd) as shown below in our heat map in darker colors:

1. Heat map to visualize collinearity of the dependent variables

Removing these features manually can be cumbersome because any given feature can be collinear with more than one feature. This is where we use VIF as it takes into account more than just a pairwise comparison of correlation. In the interest of keeping this post short, VIF quantifies how much of the variance of the estimated coefficients in our regression model is inflated. We removed features that had high VIF (about 10). You can read more about this in the Jupyter Notebook in my Github repo (link given below). For this model the top 3 most important features are:

  • Square footage above ground
  • Distance to the Seattle center
  • Median household income

Visuals and analysis: After exploring all the features, here are some prominent characteristics of the top 3 features:

Overall southern King County is cheaper than the northern parts with the area around Seattle being the most expensive as seen below.

2. Price per square feet.

As expected prices went up as we move closer to the Seattle center (shown below)

3. Average price based on distance to Seattle center

Contrary to expectation, median income was higher towards the north east, which is among the cheaper neighborhoods in terms of price per sqft.

4. Median household Income

Similarly, we looked into the housing prices near popular points of interest, such as popular landmarks, and the Metro station in the following two graphs. We observed that landmarks in Seattle are more expensive than landmarks outside of Seattle. Additionally, houses near the Metro stations also get more expensive as we get close to Seattle.

5.House prices close to popular places

Here is pricing close to the Metro:

6.House prices close to the metro

When considering local attractions or transportation, if you plan on living within a 5 mile radius from any of the points of interest mentioned above, you should strongly consider Seattle Center, Kerry Park or Discovery Park. These homes have the highest average price within 5 miles. The Link System serves parts of King County with both higher and lower priced homes, however, homes within 20 miles of a Link Station tend to have higher values. It should be noted that the Link System only serves the western part of King County, so if public transportation is important, it will limit the buyer to the western part of the county. If access to an airport is an important buying or selling point, you should look at homes that are between 10 and 15 miles from the airport. These homes are far away enough to avoid noise from air traffic while still close enough to the airport to allow for a short commute .

Modeling: We were able to predict housing prices using a Multiple Linear Regression model with an accuracy of about 80%. Below is the overlap between predicted and actual price using a scatter plot:

7. Actual / Predicted scatter plot

Thanks for reading! I hope this blog gives the reader an overview of the main steps involved in house price prediction. The git hub repo for this project is being updated and it will be posted soon!

--

--