1 Introduction

1.1 Why

Why did I choose “Air Pollution in the city of Gijón” as my final project for my Master’s in Data Science?

I was born in Gijón and I was working in Madrid for 12 years. Both cities suffer from low levels of air quality. And in recent years air pollution has become a great source of public concern.

Therefore, the idea came up in a very natural way. I then saw that the Town Hall of Gijón had an open data website where it had published 18 years of hourly data from its six air quality monitoring stations. That seemed like a sign.

I had a topic. And, now, I had the data.

1.2 What

The main objectives of this Data Science project are:

  • 1. To construct and publish a web dashboard so as to visualize the evolution of the main air pollution indicators in the city of Gijón.

It might seem trivial and hardly useful but the fact is that, after searching the web, I didn’t find any service where these kind of evolutions could be easily consulted.

This lack of public information could be contributing to generating a very pessimistic vision of the air quality in the city of Gijón.

It is true that Gijón presents air quality levels on some pollutants which are below the recommendations of the WHO. But, the fact is that the air quality in Gijón has improved drastically during the last two decades.

I am probably being too optimistic, but I would like to think that through this work I’m helping to improve public understanding on the matter.

In order to accomplish this objective I finally choose Tableau software and its free visualization service web publishing Tableau Public as my main tools.

If you want to see the final result you can go to the chapter 4 “Visualizations”.

  • 2. To build forecasting models in order to predict pollution levels.

My second objective is to try to forecast hourly pollution levels one to twenty four hours in advance.

Initially, my intention was to create models for every pollutant and every monitoring station. But, finally, I have focused my efforts on only one station (Constitución) and two of the pollutants, PM10 and NO2.

1.3 How

The following is a list the software tools and services I used to develop this project:

  • R for importing, cleaning, exploring and preparing the data for the prediction models.
  • Python for making the models (except the ARIMA models, made with R and the fantastic Forecast package).
  • Rmarkdown to write this document and the Bookdown package to build the html book. And I took advantage too from the publishing free service at Bookdown.org to publish the final document too..
  • Git and Github for control version of my work.
  • The Kaggle platform to publish the final dataset with all the data.
  • R-Studio on a Windows laptop (with Windows subsystem for Linux) to manage all the R code and Google Collab notebooks to write the Python code.
  • Tableau Desktop (student license) to create the dashboards and Tableau Public to publish the final stories.
  • Microsoft Excel to elaborate the csv file with the Gijón holiday dates and the table with the compilations of all the model scores.

All the files and the code needed to replicate the entire work is saved in this Github repository, with the exception of the final csv file with all the data used by the Tableau Public visualizations (it exceeds the Github file size limits). This file can be downloaded from Kaggle.

To reproduce the project all you need to do is to execute the following scripts in order. The rmarkdown files are locally executable, but the Python notebooks were created and executed in Google Colab. So, to assure their correct running my advice is to use the same platform:

  • 10_Gathering_and_Cleaning_Data.rmd
  • 11_Data_Exploration.rmd
  • _10_1_train_test_PM10_AR_datasets.rmd
  • _10_2_train_test_PM10_NO2_MULTIVAR_datasets.rmd
  • 12_Forecasting_Models_ARIMA_PM10.rmd
  • 21_Forecasting_Models_ML_PM10_AR.ipynb
  • 22_Forecasting_Models_ML_PM10_MULTIVAR.ipynb
  • 23_Forecasting_Models_ML_NO2_MULTIVAR.ipynb
    (ipynb notebooks to be executed on Google Colab. The links to this notebooks are avalaible in the section 4 of this document “Forecasting models”)

And in order to render this html book you would have to install the Bookdown package on R and follow these instructions.