1 Introduction

1.1 Overview and Motivation

Airbnb is an online market place created in 2008 which lets people rent out their properties or spare rooms to guests.

Today the platform has more than 7 millions listed properties and is present in more than 192 countries.

Thus, the difficulty for property owners is to set the right price that will maximize their earnings. Knowing the internal and external factors that influence the price is, therefore, crucial.

Real estate and the sharing economy are topics that particularly interest us. Thus, this project is not only an opportunity for us to deepen our knowledge in this field but it also offers us the opportunity to provide answers to essential questions. We thought it would be interesting to estimate the impact of external factors, criminality in our case, on the rental price of Airbnb properties.

We decided to work on the United States, more precisely Los Angeles. Thanks to its large amount of available data and geographical scope, Los Angeles seemed to be ideal to have a robust analysis.

Understanding the impact of crime on Airbnb rental prices in the city of Los Angeles has several benefits. Indeed it allows to know: * What types of crimes drive the value of rental property down
* Which neighbourhoods are the most promising and those to avoid

We planned on doing a 3 steps regression in order to determine the value of a good:
1. Determine the general value of a good by using its characteristics (i.e. number of rooms, amenities, number of bathrooms,…)
2. Determine a score for each neighbourhood based on its criminality
3. How does this score affect the value of a good?

1.3 Initial Questions

Can we assess the impact of the crime on the rental price of Airbnb in the city of Los Angeles?
For this particular question, which is probably one of the most important of our project, we were planning on trying to determine the effect that criminality has on price (by using a criminality score) and controlled by the following characteristics of the Airbnb goods: the price, property type, room_type,number of people it can accommodates, number of bathrooms, number of bedrooms, number of beds, certain amenities and bed_type. More generally, we used the neighbourhood to match the Airbnb data and the criminality.

Can we determine the types of crimes having the most impact?
To answer this question, we can use the interpretation of the coefficient in our conclusion.

Can we anticipate the future evolution of prices according to the evolution of crime in the city of Los Angeles?
When formulating the questions, we have been a bit too optimistic. Indeed, when looking at the airbnb data we found for differents years, we found very little, if some, variation through time for the very few properties present throughout the whole period. Moreover, as we have a very small time frame, our analysis would have been significantly biased. For those reasons, we decided to discard this question.

What are the most profitable types of property according to their characteristics?
We will try to find an answer to this question in the conclusion of our report by analyzing the different coefficient of the model predicting the price from the good’s characteristics.

Moreover, we found interesting to see the repartition of certain characteristics and their relation with the median price. Those information can be found in the exploratory data analysis regarding the Airbnb dataset.

What is the evolution of crime in the city of Los Angeles?
We decided to use the data ranging from January 1 2010 to December 31 2019. This allows use to have a vision on a relatively long scope and disregard eventual “good” and “bad” years. We will develop more about this question in the exploratory data analysis.

Which area are the most affected by criminality?
To answer this question, we decided to mainly do some interactive map. There are 3 maps you will find in he exploratory data analysis, the first one showing the number of crimes in each neighbourhood, the second one displaying the number of crime of each type in each neighbourhood and the final one displaying the “score” we created for each type of crime.

1.4 Data

1.4.1 Where you can get it

We collected data from the open data of the city of Los Angeles through their website.
https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-2019/63jg-8b9z

Concerning the Airbnb dataset, we decided to use the following data set from inside airbnb containing all characteristics of the placing. We took the Detailed Listings data for Los Angeles scrapped in may of 2019.
http://insideairbnb.com/get-the-data.html

Moreover, we used the following data to define each neighbourhood. We renamed the map ‘la-county-neighborhoods-current.geojson’.
https://usc.data.socrata.com/dataset/Los-Angeles-Neighborhood-Map/r8qd-yxsr

The population density for each neighbourhood has been extracted from the latimes website. The table on the website has simply been replicated in a csv file. (Note that the website is not available from Europe. We had to use a VPN to scrap those data). We named this file ‘Population.csv’.
https://maps.latimes.com/neighborhoods/population/density/neighborhood/list/
We added to this list the population density of “Chatsworth” and “Diamond bar” thanks to the population found on niche.com and the square milage indicated in the GeoJson file previously mentionned. Regarding “Whittier narrows”, we used the same technique but from the website point2homes.com. Finally, “Universal City”, “Griffith Park”, “Hansen Dam” and “Sepulveda Basin” have been added with a population density of 0 as they have no population beause of their characteristics (theme park, park, dam and basin). (You can access the file already modified with this link:
https://drive.google.com/open?id=1JvTGY9B338VdMDQBQPhAcd9zWOwRlFuY)

1.4.2 Data description and methods

Police crime: crime from 2010-2019, with 2.11 millions observations and 28 characteristics per observation. the main variables we are on using are: Date rptd: the date the crime was reported LON and LAT: the latitude and longitude allowing us to assign a neighbourhood to each crime. CRM CD (desc): The police code (description) of the crime Premis Cd: which is the code for the type of structure, vehicle, or location where the crime took place CRM Cd 2,3,4: code for an additional crime, less serious than Crime Code

Since we have data from 2010 to 2019, we are using the entire dataset.

Airbnb listing scrapped in May 2019: 38851 properties around los angeles with 106 characteristics per property as of February 16, 2020. The main variables we planned on using are: Price: price per night of the Airbnb Property_type: the type of property (i.e. apartement, guesthouse, house,…) Room_type: the type of location (entire home, private room, shared room or hotel room) Accommodates: number of people that this place can accommodate Bathrooms: number of bathrooms Bedrooms: number of bedrooms Beds: number of beds Bed_type: the type of bed (real bed, couch, futon, air bed or pull-out sofa) longitude and latitude: the longitude and latitude allowing us to assign a neighbourhood to each airbnb listing.

We decided to focus our analysis on the dataset from may 2019.

To join both datasets, we used the latitude and longitudes also available and matching them with the GPS coordinates of a neighbourhood layout.

la-county-neighborhoods-current: This GeoJson files contains the 272 neighbourhoods used in our analysis and allowed use to not only join the datasets together but also to extract the name and population of each neighbbourhood. The variables we are using are: name: the name of the neighbourhood geometry & location: map of the neighbourhood sqmi: Surface in square miles of each neighbourhood

Population: this dataset scrapped from the internet contains 265 neighbourhoods and their matching population density. we are using the following variables: NEIGHBOURHOOD: the name of the neighbourhood POPULATION PER SQMI: the population per square miles in this neighbourhood.

Using this dataset and the map, we are able to extract the population per neighbourhood by doing POPULATION PER SQMI * sqmi.

Regarding the missing data about certains neighbourhood, we fixed it by scrapping the population online and dividing it manually by the sqmi from the map to create the population density of those neighbourhoods. “Chatsworth” and “Diamond bar” are from niche.com, “Whittier narrows” is from point2homes.com, “Universal City”, “Griffith Park”, “Hansen Dam” and “Sepulveda Basin” have no population because there are either a theme park or a location with no inhabitant.