2 Wrangling

Our data are composed by four databases:

  1. electricity_state_info brings us all the information about electricity in each state of US. Those information are dated from 2002 to 2018.

  2. generation_stateis a database containing the type of producer, the energy source and the quantity of electricity generation (in Megawatthours) from 1990 to 2018.

  3. state_info is giving us all the econonomic aspects and other characteristics of each American state.

  4. all_breakdownprovides all information about the power production from various power sources such as geothermal, biomass, biogas, hydro, wind and solar energy.

First, we need to clean those different databases.

In state_info and in electricity_state_info, we remove the dot at the beginning of each observation of the variable “state”. Then, we create a new variable in electricity_state_infothat will tell us if the state is energy self-sufficient.

In the table generation_state, there are only state abbreviations so we transform it into names and we add the region relative to the state.

We also add the region relative to each state in the electricity_state_info table.

Again, we do the same operation to the state_info database.

We create a new dataframe which is the jointure between state_info and electricity_state_info in order to have all economic, social and energy information per state.

Inall_breakdown, we add new columns such as the year, the month, the day and the hour of the observations.