2 Wrangling
Our data are composed by four databases:
electricity_state_info
brings us all the information about electricity in each state of US. Those information are dated from 2002 to 2018.generation_state
is a database containing the type of producer, the energy source and the quantity of electricity generation (in Megawatthours) from 1990 to 2018.state_info
is giving us all the econonomic aspects and other characteristics of each American state.all_breakdown
provides all information about the power production from various power sources such as geothermal, biomass, biogas, hydro, wind and solar energy.
electricity_state_info <- read_excel(here("data/electricity_state_info.xlsx"))
state_info <- read_excel(here("data/STATE_info.xls"))
generation_state <- read_excel(here("data/annual_generation_state.xls"))
all_breakdown <- read_csv(here("data/all_breakdown.csv"))
First, we need to clean those different databases.
In state_info
and in electricity_state_info
, we remove the dot at the beginning of each observation of the variable “state”. Then, we create a new variable in electricity_state_info
that will tell us if the state is energy self-sufficient.
In the table generation_state
, there are only state abbreviations so we transform it into names and we add the region relative to the state.
We also add the region relative to each state in the electricity_state_info
table.
Again, we do the same operation to the state_info
database.
We create a new dataframe which is the jointure between state_info
and electricity_state_info
in order to have all economic, social and energy information per state.
Inall_breakdown
, we add new columns such as the year, the month, the day and the hour of the observations.