Machine Learning using RandomForestRegression

Introduction

In this project I use Airbnb data to analyze their price distribution and to create a machine learning model that can be used to predict the price of a given listing. Airbnb data for major cities and towns around the world is publicly available at Inside Airbnb project. For this analysis, I work with the data for Berlin, Germany.

1. Overview

First, we download the data from this link. The file listings.csv.gz has all the data that we need but the other files are also useful for further analysis and visualization. Once we have the data we need, we can start our exploratory analysis.
Let’s import the modules and load the data as shown here. In the code above, we imported:

  • Pandas and Numpy for data processing
  • Matplotlib & Seaborn for visualizing our data
  • Train_test_split for spliting our data into sets for training and testing our model
  • RandomForestRegressor for creating a regression model. This is a supervised machine learning task and RandomForest is a good candidate for it. Based on the performance of our model in predicting apartment prices, we might also try other algorithms for comparison
  • Mean_squared_error and mean_absolute_error for measuring the performance of our model

Let’s take a quick look at the shape and overview of our data.
As shown below, we have 20432 listings and 74 columns (features). data overview

2. Data Processing

Let’s remove the currency notation from the prices using this function Once prices are cleaned up we can look at their distribution price summary

We can make the following observations:

  • The average cost cost of renting an apartment is $68 and the median price is $50
  • Prices range from $0 to $8000
  • 75% of the listings cost $80 or less per night
  • Only 4.98% of the listings cost more than the 95th percentile price ($155)
  • Only 0.4% of the listings cost $500 or more

3. Exploratory Data Analysis

Let’s plot the price distribution price distribution | price less than $500

  • In the second graph we zoom in on apartments that cost less than $500

3.1 Price Comparison

Let’s take a look at how some of the features such as distance from the city center, number of bedrooms, room type and neighbourhood affect prices

a) Distance from city center

Let’s calculate the distance from the city center. For Berlin we will use Alexendarplatz square in Central Mitte District as the center. Using the following function, we calculate the Great-circle Distance in kilometers for all the apartment listings using the harvesine formula.

  • Based on the following distribution, we can see that prices generally decrease as you move away from the city center
    price v distance
b) Number of bedrooms
  • One bedrooms apartments are the cheapest on average
  • The average prices linearly increases for apartments with 1-5 bedrooms but fluctuates for apartments with 6 or more bedrooms. This could be a sign of anormalies or outliers pricing for apartments with the highest number of bedrooms price v bedrooms
c) Maximum number of guests
  • As expected, price increases with the number of guests price v guests
d) Room Type
  • It’s cheaper if you shared the apartment with the host
  • Hotel rooms are the most expensive on average price v room type
e) Neighbourhood
  • Apartments in Charlottenburg ($81) are the most expensive followed by Mitte boroughs ($71)
  • Reinickendorf has the the cheapest apartments price v neighbourhood

4. Feature Selection

We have looked at how some of the features affect pices. In this section, we are going to determine which other features in our dataset greatly affect the price of renting an Airbnb apartment.

i) Missing Values

First let’s check which columns have the highest number of values missing

This table shows the top ten columns with the highest number of missing values
missing values table

  • We have another column named bathrooms_text so we shouldn’t worry about bathrooms
  • We have 9 columns with more than 30% values missing. We’ll drop these columns
  • We also have 14 columns with between 5% and 30% values missing
  • For integer columns with missing values, we impute the missing values using the median strategy. In this case, we fill the missing values in a column with the calculated median for that column
  • For categorical columns, we will add an ‘uknown’ class to the data
ii) Feature Importance

Now we need to determine the best features to use in our model. We will use a RandomForestRegressor to fit our feaures and the target variable then we calculate the feature importances. The code below shows how to find important features

Here are the top 15 features and their scores

missing values table

5. Model Selection

a) Fit the model

Using the top features, we can train our final model. We either select top n features or we select features with scores above a certain threshold then we train several models to compare performances
As shown below, we selected 50 features to the model

b) Make predictions

Once our model is trained, we use it to make predictions

  • Then we score the performance of our model using both the mean squared errors and the mean absolute error
  • The model returned an R^2 of 5.3% which means it’s not doing a great job at explaining the variance in our data
  • We also have a RMSE (Root Mean Squared Error) of 147.5 which indicates that in a given prediction, our model is missing the prediction by $147
  • Similarly our MAE (Mean Absolute Error) is 25.5
  • This model needs further tuning in order to improve our prediction results.

Looking at the results however, we can see that our predicted prices closely match the actual prices. final results prediction

Conclusion

In this analysis, we looked at how a RandomForestRegession model performs given a set of features. We definitely need to further tweak the model parameters in order to improve performance. We can also try other regression models like XGBoost for comparison.