Machine Learning using RandomForestRegression

Introduction

In this project I use Airbnb data to analyze their price distribution and to create a machine learning model that can be used to predict the price of a given listing. Airbnb data for major cities and towns around the world is publicly available at Inside Airbnb project. For this analysis, I work with the data for Berlin, Germany.

1. Overview

First, we download the data from this link. The file listings.csv.gz has all the data that we need but the other files are also useful for further analysis and visualization. Once we have the data we need, we can start our exploratory analysis.
Let’s import the modules and load the data as shown here. In the code above, we imported:

Pandas and Numpy for data processing
Matplotlib & Seaborn for visualizing our data
Train_test_split for spliting our data into sets for training and testing our model
RandomForestRegressor for creating a regression model. This is a supervised machine learning task and RandomForest is a good candidate for it. Based on the performance of our model in predicting apartment prices, we might also try other algorithms for comparison
Mean_squared_error and mean_absolute_error for measuring the performance of our model

Let’s take a quick look at the shape and overview of our data.
As shown below, we have 20432 listings and 74 columns (features). data overview

2. Data Processing

Let’s remove the currency notation from the prices using this function Once prices are cleaned up we can look at their distribution price summary

We can make the following observations:

The average cost cost of renting an apartment is $68 and the median price is $50
Prices range from $0 to $8000
75% of the listings cost $80 or less per night
Only 4.98% of the listings cost more than the 95th percentile price ($155)
Only 0.4% of the listings cost $500 or more

3. Exploratory Data Analysis

Let’s plot the price distribution | price less than $500

In the second graph we zoom in on apartments that cost less than $500

3.1 Price Comparison

Let’s take a look at how some of the features such as distance from the city center, number of bedrooms, room type and neighbourhood affect prices

a) Distance from city center

Let’s calculate the distance from the city center. For Berlin we will use Alexendarplatz square in Central Mitte District as the center. Using the following function, we calculate the Great-circle Distance in kilometers for all the apartment listings using the harvesine formula.

Based on the following distribution, we can see that prices generally decrease as you move away from the city center

b) Number of bedrooms

One bedrooms apartments are the cheapest on average
The average prices linearly increases for apartments with 1-5 bedrooms but fluctuates for apartments with 6 or more bedrooms. This could be a sign of anormalies or outliers pricing for apartments with the highest number of bedrooms

c) Maximum number of guests

As expected, price increases with the number of guests

d) Room Type

It’s cheaper if you shared the apartment with the host
Hotel rooms are the most expensive on average

e) Neighbourhood

Apartments in Charlottenburg ($81) are the most expensive followed by Mitte boroughs ($71)
Reinickendorf has the the cheapest apartments

4. Feature Selection

We have looked at how some of the features affect pices. In this section, we are going to determine which other features in our dataset greatly affect the price of renting an Airbnb apartment.

i) Missing Values

First let’s check which columns have the highest number of values missing

This table shows the top ten columns with the highest number of missing values
missing values table

We have another column named bathrooms_text so we shouldn’t worry about bathrooms
We have 9 columns with more than 30% values missing. We’ll drop these columns
We also have 14 columns with between 5% and 30% values missing
For integer columns with missing values, we impute the missing values using the median strategy. In this case, we fill the missing values in a column with the calculated median for that column
For categorical columns, we will add an ‘uknown’ class to the data

ii) Feature Importance

Now we need to determine the best features to use in our model. We will use a RandomForestRegressor to fit our feaures and the target variable then we calculate the feature importances. The code below shows how to find important features

Here are the top 15 features and their scores

missing values table

5. Model Selection

a) Fit the model

Using the top features, we can train our final model. We either select top n features or we select features with scores above a certain threshold then we train several models to compare performances
As shown below, we selected 50 features to the model

b) Make predictions

Once our model is trained, we use it to make predictions

Then we score the performance of our model using both the mean squared errors and the mean absolute error
The model returned an R^2 of 5.3% which means it’s not doing a great job at explaining the variance in our data
We also have a RMSE (Root Mean Squared Error) of 147.5 which indicates that in a given prediction, our model is missing the prediction by $147
Similarly our MAE (Mean Absolute Error) is 25.5
This model needs further tuning in order to improve our prediction results.

Looking at the results however, we can see that our predicted prices closely match the actual prices. final results prediction

Conclusion

In this analysis, we looked at how a RandomForestRegession model performs given a set of features. We definitely need to further tweak the model parameters in order to improve performance. We can also try other regression models like XGBoost for comparison.

Wesley Kirui