3 Data processing

Installation/loading of used libraries

if (!require(openxlsx)) install.packages('openxlsx') 
library(openxlsx)
if (!require(pxR)) install.packages('pxR') 
library(pxR)
if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)

Usually, we do not find data organized in the way we need for the purpose of our analysis. In fact, often the most complex job in our data exploration is locating, extracting, and understanding how the data is organized. Therefore, the first task of any data analysis is to process the data until we have it in the form and structure that we want. For this reason, we are going to dedicate some attention to data manipulation in this course.

First of all we will talk about the main databases that we will use in the course. As will be seen below, current databases on topics of general interest will mainly be used, which may be more attractive than the databases included by default in ‘R’ that are usually used in data analysis courses.

3.1 Databases

The main sources of information that we are going to use to illustrate the course contents are the following databases of general interest:

  • OWID: global database on current affairs

  • UNdata: database of the United Nations organization.

  • INE: database of the Spanish National Statistics Institute.

  • ISTAC : database of the Canarian Institute of Statistics

  • UNICEF : UNICEF database

  • World Health Organization. : World Health Organization database

  • Kaggle : Kaggle is a very popular data sharing platform. It has the advantage that it has a wide variety of all types of data, but caution must be taken because anyone can upload data and this can compromise its reliability. Kaggle assigns to each dataset a “Usability” value which can be interpreted as a data quality factor that includes the quality of the data sources. It is important to use data with a high value of this quality factor.

In this book we will also use some databases stored in a local data repository called data. The files referenced in this book to this repository can be downloaded from https://ctim.es/AEDV/data/. These local databases have generally been extracted from the above databases and for convenience have been placed in a local repository.

In addition to these databases, R has a large collection of directly accessible example databases. For example, the car information database mtcars is widely used to illustrate the functionalities of R.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We will not use these databases incorporated in R in this course.

3.2 Data file formats

To analyze data, the first thing you have to do is access it, which is generally done by reading the data from a file where it is stored as a table. Below we will see the most common data file formats and how to read them using the corresponding libraries. The reading libraries we use allow us to read the data directly from a WEB server, which is an added value. As a summary of this section it may be useful to look at the following summary sheet.

CSV format

The CSV format consists of plain text, that is, it contains the table values without any other information whether they are dates, numbers, characters, etc. It can be edited with any text editor (such as NOTEPAD) and the variables are separated from each other by a comma and decimals are expressed with points. To read CSV files we will use the read_csv function of the readl library included in the tidyverse library. If the variables in the table are separated by the ; symbol and also the decimals of the numbers are expressed with commas, we will use the read function read_csv2. If the delimiter between variables is neither a comma nor a period, we can use the read_delim function that allows us to declare any delimiter between variables. A more detailed description of these functions appears in the aforementioned summary sheet.

Next we will read a CSV file with information on the world population from the United Nations (UN) database. As often happens, data tables require manipulation to correct errors or select the variables that interest us. In this case, the first line of the table contains comments and in the second line, which contains the variable names, there is a variable that comes without a name. In later sections of this chapter we will see how to manipulate/transform the data. This database can be read directly from the United Nations server using the script:

UN_population <- read_csv('https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv',skip = 1)%>%
   as_tibble()

For convenience, we have saved a copy of the file in our local repository and read it from that source. We add skip=1 to the read instruction to ignore the first line of the table that contains a comment, once we read the table we use the str function to analyze its structure:

UN_population <- read_csv("https://ctim.es/AEDV/data/UN_population.csv",skip = 1)%>%
   as_tibble()
str(UN_population)
## tibble [7,873 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Region/Country/Area: num [1:7873] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ...2               : chr [1:7873] "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
##  $ Year               : num [1:7873] 2010 2010 2010 2010 2010 ...
##  $ Series             : chr [1:7873] "Population mid-year estimates (millions)" "Population mid-year estimates for males (millions)" "Population mid-year estimates for females (millions)" "Sex ratio (males per 100 females)" ...
##  $ Value              : num [1:7873] 6985.6 3514.4 3471.2 101.2 27.1 ...
##  $ Footnotes          : chr [1:7873] NA NA NA NA ...
##  $ Source             : chr [1:7873] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision; supplemented by da"| __truncated__ ...

The actual content of some of the variables is often not very clear at first glance. Therefore, in exploratory data analysis, the first step is to analyze the content of the tables in some detail. To analyze the range of possible values of the variables, especially in the case of categorical variables (factor type), we can see the content of some of the variables using levels and taking the variable as factor so that we get the range of values. the possible values without repeating. For numerical variables we can use the `summary’ function to get a first idea of their range of values. For this first table we present an exhaustive analysis of the variables. In general, in this course, in order not to make the presentation of the tables used too cumbersome, we will not analyze the tables used in such detail, but it is essential that students acquire the habit of analyzing the tables in detail before starting to work with them. .

levels(as.factor(UN_population$`Region/Country/Area`))
##   [1] "1"   "2"   "4"   "5"   "8"   "9"   "11"  "12"  "13"  "14"  "15"  "16" 
##  [13] "17"  "18"  "19"  "20"  "21"  "24"  "28"  "29"  "30"  "31"  "32"  "34" 
##  [25] "35"  "36"  "39"  "40"  "44"  "48"  "50"  "51"  "52"  "53"  "54"  "56" 
##  [37] "57"  "60"  "61"  "62"  "64"  "68"  "70"  "72"  "76"  "84"  "90"  "92" 
##  [49] "96"  "100" "104" "108" "112" "116" "120" "124" "132" "136" "140" "142"
##  [61] "143" "144" "145" "148" "150" "151" "152" "154" "155" "156" "158" "170"
##  [73] "174" "175" "178" "180" "184" "188" "191" "192" "196" "202" "203" "204"
##  [85] "208" "212" "214" "218" "222" "226" "231" "232" "233" "234" "238" "242"
##  [97] "246" "250" "254" "258" "262" "266" "268" "270" "275" "276" "288" "292"
## [109] "296" "300" "304" "308" "312" "316" "320" "324" "328" "332" "336" "340"
## [121] "344" "348" "352" "356" "360" "364" "368" "372" "376" "380" "384" "388"
## [133] "392" "398" "400" "404" "408" "410" "414" "417" "418" "419" "422" "426"
## [145] "428" "430" "434" "438" "440" "442" "446" "450" "454" "458" "462" "466"
## [157] "470" "474" "478" "480" "484" "492" "496" "498" "499" "500" "504" "508"
## [169] "512" "516" "520" "524" "528" "531" "533" "534" "535" "540" "548" "554"
## [181] "558" "562" "566" "570" "578" "580" "583" "584" "585" "586" "591" "598"
## [193] "600" "604" "608" "616" "620" "624" "626" "630" "634" "638" "642" "643"
## [205] "646" "652" "654" "659" "660" "662" "663" "666" "670" "674" "678" "682"
## [217] "686" "688" "690" "694" "702" "703" "704" "705" "706" "710" "716" "724"
## [229] "728" "729" "732" "740" "748" "752" "756" "760" "762" "764" "768" "772"
## [241] "776" "780" "784" "788" "792" "795" "796" "798" "800" "804" "807" "818"
## [253] "826" "830" "833" "834" "840" "850" "854" "858" "860" "862" "876" "882"
## [265] "887" "894"

The variable Region/Country/Area is a numerical code that identifies the country. This code allows us to combine the values in this table with other country data tables that use the same code. We can manage this variable as numeric or as factor, since it is used to classify which country the data of each record belongs to.

levels(as.factor(UN_population$`Region/Country/Area`))
##   [1] "1"   "2"   "4"   "5"   "8"   "9"   "11"  "12"  "13"  "14"  "15"  "16" 
##  [13] "17"  "18"  "19"  "20"  "21"  "24"  "28"  "29"  "30"  "31"  "32"  "34" 
##  [25] "35"  "36"  "39"  "40"  "44"  "48"  "50"  "51"  "52"  "53"  "54"  "56" 
##  [37] "57"  "60"  "61"  "62"  "64"  "68"  "70"  "72"  "76"  "84"  "90"  "92" 
##  [49] "96"  "100" "104" "108" "112" "116" "120" "124" "132" "136" "140" "142"
##  [61] "143" "144" "145" "148" "150" "151" "152" "154" "155" "156" "158" "170"
##  [73] "174" "175" "178" "180" "184" "188" "191" "192" "196" "202" "203" "204"
##  [85] "208" "212" "214" "218" "222" "226" "231" "232" "233" "234" "238" "242"
##  [97] "246" "250" "254" "258" "262" "266" "268" "270" "275" "276" "288" "292"
## [109] "296" "300" "304" "308" "312" "316" "320" "324" "328" "332" "336" "340"
## [121] "344" "348" "352" "356" "360" "364" "368" "372" "376" "380" "384" "388"
## [133] "392" "398" "400" "404" "408" "410" "414" "417" "418" "419" "422" "426"
## [145] "428" "430" "434" "438" "440" "442" "446" "450" "454" "458" "462" "466"
## [157] "470" "474" "478" "480" "484" "492" "496" "498" "499" "500" "504" "508"
## [169] "512" "516" "520" "524" "528" "531" "533" "534" "535" "540" "548" "554"
## [181] "558" "562" "566" "570" "578" "580" "583" "584" "585" "586" "591" "598"
## [193] "600" "604" "608" "616" "620" "624" "626" "630" "634" "638" "642" "643"
## [205] "646" "652" "654" "659" "660" "662" "663" "666" "670" "674" "678" "682"
## [217] "686" "688" "690" "694" "702" "703" "704" "705" "706" "710" "716" "724"
## [229] "728" "729" "732" "740" "748" "752" "756" "760" "762" "764" "768" "772"
## [241] "776" "780" "784" "788" "792" "795" "796" "798" "800" "804" "807" "818"
## [253] "826" "830" "833" "834" "840" "850" "854" "858" "860" "862" "876" "882"
## [265] "887" "894"
levels(as.factor(UN_population$`...2`))
##   [1] "Afghanistan"                   "Africa"                       
##   [3] "Albania"                       "Algeria"                      
##   [5] "American Samoa"                "Americas"                     
##   [7] "Andorra"                       "Angola"                       
##   [9] "Anguilla"                      "Antigua and Barbuda"          
##  [11] "Argentina"                     "Armenia"                      
##  [13] "Aruba"                         "Asia"                         
##  [15] "Australia"                     "Australia and New Zealand"    
##  [17] "Austria"                       "Azerbaijan"                   
##  [19] "Bahamas"                       "Bahrain"                      
##  [21] "Bangladesh"                    "Barbados"                     
##  [23] "Belarus"                       "Belgium"                      
##  [25] "Belize"                        "Benin"                        
##  [27] "Bermuda"                       "Bhutan"                       
##  [29] "Bolivia (Plurin. State of)"    "Bonaire, St. Eustatius & Saba"
##  [31] "Bosnia and Herzegovina"        "Botswana"                     
##  [33] "Brazil"                        "British Virgin Islands"       
##  [35] "Brunei Darussalam"             "Bulgaria"                     
##  [37] "Burkina Faso"                  "Burundi"                      
##  [39] "Cabo Verde"                    "Cambodia"                     
##  [41] "Cameroon"                      "Canada"                       
##  [43] "Caribbean"                     "Cayman Islands"               
##  [45] "Central African Republic"      "Central America"              
##  [47] "Central Asia"                  "Chad"                         
##  [49] "Channel Islands"               "Chile"                        
##  [51] "China"                         "China, Hong Kong SAR"         
##  [53] "China, Macao SAR"              "Colombia"                     
##  [55] "Comoros"                       "Congo"                        
##  [57] "Cook Islands"                  "Costa Rica"                   
##  [59] "Côte d’Ivoire"                 "Croatia"                      
##  [61] "Cuba"                          "Curaçao"                      
##  [63] "Cyprus"                        "Czechia"                      
##  [65] "Dem. People's Rep. Korea"      "Dem. Rep. of the Congo"       
##  [67] "Denmark"                       "Djibouti"                     
##  [69] "Dominica"                      "Dominican Republic"           
##  [71] "Eastern Africa"                "Eastern Asia"                 
##  [73] "Eastern Europe"                "Ecuador"                      
##  [75] "Egypt"                         "El Salvador"                  
##  [77] "Equatorial Guinea"             "Eritrea"                      
##  [79] "Estonia"                       "Eswatini"                     
##  [81] "Ethiopia"                      "Europe"                       
##  [83] "Falkland Islands (Malvinas)"   "Faroe Islands"                
##  [85] "Fiji"                          "Finland"                      
##  [87] "France"                        "French Guiana"                
##  [89] "French Polynesia"              "Gabon"                        
##  [91] "Gambia"                        "Georgia"                      
##  [93] "Germany"                       "Ghana"                        
##  [95] "Gibraltar"                     "Greece"                       
##  [97] "Greenland"                     "Grenada"                      
##  [99] "Guadeloupe"                    "Guam"                         
## [101] "Guatemala"                     "Guinea"                       
## [103] "Guinea-Bissau"                 "Guyana"                       
## [105] "Haiti"                         "Holy See"                     
## [107] "Honduras"                      "Hungary"                      
## [109] "Iceland"                       "India"                        
## [111] "Indonesia"                     "Iran (Islamic Republic of)"   
## [113] "Iraq"                          "Ireland"                      
## [115] "Isle of Man"                   "Israel"                       
## [117] "Italy"                         "Jamaica"                      
## [119] "Japan"                         "Jordan"                       
## [121] "Kazakhstan"                    "Kenya"                        
## [123] "Kiribati"                      "Kuwait"                       
## [125] "Kyrgyzstan"                    "Lao People's Dem. Rep."       
## [127] "Latin America & the Caribbean" "Latvia"                       
## [129] "Lebanon"                       "Lesotho"                      
## [131] "Liberia"                       "Libya"                        
## [133] "Liechtenstein"                 "Lithuania"                    
## [135] "Luxembourg"                    "Madagascar"                   
## [137] "Malawi"                        "Malaysia"                     
## [139] "Maldives"                      "Mali"                         
## [141] "Malta"                         "Marshall Islands"             
## [143] "Martinique"                    "Mauritania"                   
## [145] "Mauritius"                     "Mayotte"                      
## [147] "Melanesia"                     "Mexico"                       
## [149] "Micronesia"                    "Micronesia (Fed. States of)"  
## [151] "Middle Africa"                 "Monaco"                       
## [153] "Mongolia"                      "Montenegro"                   
## [155] "Montserrat"                    "Morocco"                      
## [157] "Mozambique"                    "Myanmar"                      
## [159] "Namibia"                       "Nauru"                        
## [161] "Nepal"                         "Netherlands"                  
## [163] "New Caledonia"                 "New Zealand"                  
## [165] "Nicaragua"                     "Niger"                        
## [167] "Nigeria"                       "Niue"                         
## [169] "North Macedonia"               "Northern Africa"              
## [171] "Northern America"              "Northern Europe"              
## [173] "Northern Mariana Islands"      "Norway"                       
## [175] "Oceania"                       "Oman"                         
## [177] "Other non-specified areas"     "Pakistan"                     
## [179] "Palau"                         "Panama"                       
## [181] "Papua New Guinea"              "Paraguay"                     
## [183] "Peru"                          "Philippines"                  
## [185] "Poland"                        "Polynesia"                    
## [187] "Portugal"                      "Puerto Rico"                  
## [189] "Qatar"                         "Republic of Korea"            
## [191] "Republic of Moldova"           "Réunion"                      
## [193] "Romania"                       "Russian Federation"           
## [195] "Rwanda"                        "Saint Barthélemy"             
## [197] "Saint Helena"                  "Saint Kitts and Nevis"        
## [199] "Saint Lucia"                   "Saint Martin (French part)"   
## [201] "Saint Pierre and Miquelon"     "Saint Vincent & Grenadines"   
## [203] "Samoa"                         "San Marino"                   
## [205] "Sao Tome and Principe"         "Saudi Arabia"                 
## [207] "Senegal"                       "Serbia"                       
## [209] "Seychelles"                    "Sierra Leone"                 
## [211] "Singapore"                     "Sint Maarten (Dutch part)"    
## [213] "Slovakia"                      "Slovenia"                     
## [215] "Solomon Islands"               "Somalia"                      
## [217] "South-central Asia"            "South-eastern Asia"           
## [219] "South Africa"                  "South America"                
## [221] "South Sudan"                   "Southern Africa"              
## [223] "Southern Asia"                 "Southern Europe"              
## [225] "Spain"                         "Sri Lanka"                    
## [227] "State of Palestine"            "Sub-Saharan Africa"           
## [229] "Sudan"                         "Suriname"                     
## [231] "Sweden"                        "Switzerland"                  
## [233] "Syrian Arab Republic"          "Tajikistan"                   
## [235] "Thailand"                      "Timor-Leste"                  
## [237] "Togo"                          "Tokelau"                      
## [239] "Tonga"                         "Total, all countries or areas"
## [241] "Trinidad and Tobago"           "Tunisia"                      
## [243] "Türkiye"                       "Turkmenistan"                 
## [245] "Turks and Caicos Islands"      "Tuvalu"                       
## [247] "Uganda"                        "Ukraine"                      
## [249] "United Arab Emirates"          "United Kingdom"               
## [251] "United Rep. of Tanzania"       "United States of America"     
## [253] "United States Virgin Islands"  "Uruguay"                      
## [255] "Uzbekistan"                    "Vanuatu"                      
## [257] "Venezuela (Boliv. Rep. of)"    "Viet Nam"                     
## [259] "Wallis and Futuna Islands"     "Western Africa"               
## [261] "Western Asia"                  "Western Europe"               
## [263] "Western Sahara"                "Yemen"                        
## [265] "Zambia"                        "Zimbabwe"

The variable ...2 is the name of the country and is of type character.

levels(as.factor(UN_population$Series))
## [1] "Population aged 0 to 14 years old (percentage)"      
## [2] "Population aged 60+ years old (percentage)"          
## [3] "Population density"                                  
## [4] "Population mid-year estimates (millions)"            
## [5] "Population mid-year estimates for females (millions)"
## [6] "Population mid-year estimates for males (millions)"  
## [7] "Sex ratio (males per 100 females)"                   
## [8] "Surface area (thousand km2)"

The variable ‘Series’ describes the name of the indicator associated with the data in each record, that is, it indicates what is being measured in each record. We can manage it as character or as factor since it is also used to classify records. It is often not clear from the name of the indicator what its exact meaning is. In that case, you must look at additional information in the table where the meaning of the indicators used is explained in more detail. This is very important to be able to interpret and understand the data we handle.

levels(as.factor(UN_population$Year))
## [1] "2010" "2015" "2017" "2020" "2022"

The Year variable determines the year associated with the data in each record. Usually we will convert temporary data into Date type, in this case, for example, we can associate the last day of the year.

summary(UN_population$Value)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      5.00     21.52    255.80     94.70 136162.00

The variable Value, of type numeric, supplies the numerical value of the data in each record.

levels(as.factor(UN_population$Footnotes))
##   [1] "A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas)."                                                                                                                                                                                             
##   [2] "A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas). For statistical purposes, the data for United Kingdom do not include this area."                                                                                                             
##   [3] "Calculated by the UN Statistics Division."                                                                                                                                                                                                                                                                                                                                   
##   [4] "Calculated by the UN Statistics Division.;Projected estimate (medium fertility variant)."                                                                                                                                                                                                                                                                                    
##   [5] "Changes in total area per year are the result of new measuring and correcting of the administrative borders between former Yugoslavian countries."                                                                                                                                                                                                                           
##   [6] "Comprising the Northern Region (former Saguia el Hamra) and Southern Region (former Rio de Oro)."                                                                                                                                                                                                                                                                            
##   [7] "Data refer to 1 October 2007."                                                                                                                                                                                                                                                                                                                                               
##   [8] "Data refer to the Vatican City State."                                                                                                                                                                                                                                                                                                                                       
##   [9] "Data updated according to \"Superintendencia Agraria\". Interior waters correspond to natural or artificial bodies of water or snow."                                                                                                                                                                                                                                        
##  [10] "Excluding Åland Islands."                                                                                                                                                                                                                                                                                                                                                    
##  [11] "Excluding Channel Islands (Guernsey and Jersey) and Isle of Man, shown separately, if available."                                                                                                                                                                                                                                                                            
##  [12] "Excluding Faeroe Islands and Greenland shown separately, if available."                                                                                                                                                                                                                                                                                                      
##  [13] "Excluding inland water."                                                                                                                                                                                                                                                                                                                                                     
##  [14] "Excluding Kosovo."                                                                                                                                                                                                                                                                                                                                                           
##  [15] "Excluding Niue, shown separately, which is part of Cook Islands, but because of remoteness is administered separately."                                                                                                                                                                                                                                                      
##  [16] "Excluding the islands of Saint Brandon and Agalega."                                                                                                                                                                                                                                                                                                                         
##  [17] "For statistical purposes, the data for China do not include this area."                                                                                                                                                                                                                                                                                                      
##  [18] "For statistical purposes, the data for China do not include this area.;Projected estimate (medium fertility variant)."                                                                                                                                                                                                                                                       
##  [19] "For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR) and Taiwan Province of China."                                                                                                                                                            
##  [20] "For statistical purposes, the data for Denmark do not include Faroe Islands, and Greenland."                                                                                                                                                                                                                                                                                 
##  [21] "For statistical purposes, the data for Denmark do not include this area."                                                                                                                                                                                                                                                                                                    
##  [22] "For statistical purposes, the data for France do not include French Guiana, French Polynesia, Guadeloupe, Martinique, Mayotte, New Caledonia, Réunion, Saint Pierre and Miquelon, Saint Barthélemy, Saint Martin (French part), Wallis and Futuna Islands."                                                                                                                  
##  [23] "For statistical purposes, the data for France do not include this area."                                                                                                                                                                                                                                                                                                     
##  [24] "For statistical purposes, the data for France do not include this area.;Projected estimate (medium fertility variant)."                                                                                                                                                                                                                                                      
##  [25] "For statistical purposes, the data for Netherlands do not include Aruba, Bonaire, Sint Eustatius and Saba, Curaçao, and Sint Maarten (Dutch part)."                                                                                                                                                                                                                          
##  [26] "For statistical purposes, the data for Netherlands do not include this area."                                                                                                                                                                                                                                                                                                
##  [27] "For statistical purposes, the data for Netherlands do not include this area.;Projected estimate (medium fertility variant)."                                                                                                                                                                                                                                                 
##  [28] "For statistical purposes, the data for New Zealand do not include Cook Islands, Niue, and Tokelau."                                                                                                                                                                                                                                                                          
##  [29] "For statistical purposes, the data for New Zealand do not include this area."                                                                                                                                                                                                                                                                                                
##  [30] "For statistical purposes, the data for United Kingdom do not include this area."                                                                                                                                                                                                                                                                                             
##  [31] "For statistical purposes, the data for United States of America do not include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and United States Virgin Islands."                                                                                                                                                                                               
##  [32] "For statistical purposes, the data for United States of America do not include this area."                                                                                                                                                                                                                                                                                   
##  [33] "Including Abkhazia and South Ossetia."                                                                                                                                                                                                                                                                                                                                       
##  [34] "Including Agalega, Rodrigues and Saint Brandon."                                                                                                                                                                                                                                                                                                                             
##  [35] "Including Åland Islands."                                                                                                                                                                                                                                                                                                                                                    
##  [36] "Including Andorra, Gibraltar, Holy See, and San Marino."                                                                                                                                                                                                                                                                                                                     
##  [37] "Including Anguilla, Bonaire, Sint Eustatius and Saba, British Virgin Islands, Cayman Islands, Dominica, Montserrat, Saint Kitts and Nevis, Sint Maarten (Dutch part) and Turks and Caicos Islands."                                                                                                                                                                          
##  [38] "Including Ascension and Tristan da Cunha."                                                                                                                                                                                                                                                                                                                                   
##  [39] "Including Ascension and Tristan da Cunha. For statistical purposes, the data for United Kingdom do not include this area."                                                                                                                                                                                                                                                   
##  [40] "Including Ascension and Tristan da Cunha.;St. Helena Island has no substantial natural inland waters however there are 15 reservoirs and similar open water storage features on island."                                                                                                                                                                                     
##  [41] "Including Bermuda, Greenland, and Saint Pierre and Miquelon."                                                                                                                                                                                                                                                                                                                
##  [42] "Including Canary Islands, Ceuta and Melilla."                                                                                                                                                                                                                                                                                                                                
##  [43] "Including Christmas Island, Cocos (Keeling) Islands and Norfolk Island."                                                                                                                                                                                                                                                                                                     
##  [44] "Including Crimea."                                                                                                                                                                                                                                                                                                                                                           
##  [45] "Including East Jerusalem."                                                                                                                                                                                                                                                                                                                                                   
##  [46] "Including Falkland Islands (Malvinas)."                                                                                                                                                                                                                                                                                                                                      
##  [47] "Including Liechtenstein and Monaco."                                                                                                                                                                                                                                                                                                                                         
##  [48] "Including low water level for all islands (area to shoreline)."                                                                                                                                                                                                                                                                                                              
##  [49] "Including Marshall Islands, Nauru, Northern Mariana Islands and Palau."                                                                                                                                                                                                                                                                                                      
##  [50] "Including Nagorno-Karabakh."                                                                                                                                                                                                                                                                                                                                                 
##  [51] "Including Norfolk Island."                                                                                                                                                                                                                                                                                                                                                   
##  [52] "Including Pitcairn."                                                                                                                                                                                                                                                                                                                                                         
##  [53] "Including Sabah and Sarawak."                                                                                                                                                                                                                                                                                                                                                
##  [54] "Including Saint Helena."                                                                                                                                                                                                                                                                                                                                                     
##  [55] "Including Svalbard and Jan Mayen Islands."                                                                                                                                                                                                                                                                                                                                   
##  [56] "Including the Faroe Islands and the Isle of Man."                                                                                                                                                                                                                                                                                                                            
##  [57] "Including the Transnistria region."                                                                                                                                                                                                                                                                                                                                          
##  [58] "Including water bodies of lake Vaihiria, lake Temae and the Maiao lagoons, but not lake Maeva and the lagoons of Raiatea and Tahaa from the upper islands. No lagoons from lower islands are included."                                                                                                                                                                      
##  [59] "Including Zanzibar."                                                                                                                                                                                                                                                                                                                                                         
##  [60] "Inland waters include the reservoirs."                                                                                                                                                                                                                                                                                                                                       
##  [61] "Land area only."                                                                                                                                                                                                                                                                                                                                                             
##  [62] "Land area only. Excluding 84 square km of uninhabited islands."                                                                                                                                                                                                                                                                                                              
##  [63] "Projected estimate (medium fertility variant)."                                                                                                                                                                                                                                                                                                                              
##  [64] "Projected estimate (medium fertility variant).;A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas). For statistical purposes, the data for United Kingdom do not include this area."                                                              
##  [65] "Projected estimate (medium fertility variant).;Data refer to the Vatican City State."                                                                                                                                                                                                                                                                                        
##  [66] "Projected estimate (medium fertility variant).;Excluding Kosovo."                                                                                                                                                                                                                                                                                                            
##  [67] "Projected estimate (medium fertility variant).;For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR) and Taiwan Province of China."                                                                                                             
##  [68] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Denmark do not include Faroe Islands, and Greenland."                                                                                                                                                                                                                                  
##  [69] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Denmark do not include this area."                                                                                                                                                                                                                                                     
##  [70] "Projected estimate (medium fertility variant).;For statistical purposes, the data for France do not include French Guiana, French Polynesia, Guadeloupe, Martinique, Mayotte, New Caledonia, Réunion, Saint Pierre and Miquelon, Saint Barthélemy, Saint Martin (French part), Wallis and Futuna Islands."                                                                   
##  [71] "Projected estimate (medium fertility variant).;For statistical purposes, the data for France do not include this area."                                                                                                                                                                                                                                                      
##  [72] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Netherlands do not include Aruba, Bonaire, Sint Eustatius and Saba, Curaçao, and Sint Maarten (Dutch part)."                                                                                                                                                                           
##  [73] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Netherlands do not include this area."                                                                                                                                                                                                                                                 
##  [74] "Projected estimate (medium fertility variant).;For statistical purposes, the data for New Zealand do not include Cook Islands, Niue, and Tokelau."                                                                                                                                                                                                                           
##  [75] "Projected estimate (medium fertility variant).;For statistical purposes, the data for New Zealand do not include this area."                                                                                                                                                                                                                                                 
##  [76] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United Kingdom do not include this area."                                                                                                                                                                                                                                              
##  [77] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United States of America do not include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and United States Virgin Islands."                                                                                                                                                
##  [78] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United States of America do not include this area."                                                                                                                                                                                                                                    
##  [79] "Projected estimate (medium fertility variant).;Including Abkhazia and South Ossetia."                                                                                                                                                                                                                                                                                        
##  [80] "Projected estimate (medium fertility variant).;Including Agalega, Rodrigues and Saint Brandon."                                                                                                                                                                                                                                                                              
##  [81] "Projected estimate (medium fertility variant).;Including Åland Islands."                                                                                                                                                                                                                                                                                                     
##  [82] "Projected estimate (medium fertility variant).;Including Andorra, Gibraltar, Holy See, and San Marino."                                                                                                                                                                                                                                                                      
##  [83] "Projected estimate (medium fertility variant).;Including Anguilla, Bonaire, Sint Eustatius and Saba, British Virgin Islands, Cayman Islands, Dominica, Montserrat, Saint Kitts and Nevis, Sint Maarten (Dutch part) and Turks and Caicos Islands."                                                                                                                           
##  [84] "Projected estimate (medium fertility variant).;Including Ascension and Tristan da Cunha. For statistical purposes, the data for United Kingdom do not include this area."                                                                                                                                                                                                    
##  [85] "Projected estimate (medium fertility variant).;Including Bermuda, Greenland, and Saint Pierre and Miquelon."                                                                                                                                                                                                                                                                 
##  [86] "Projected estimate (medium fertility variant).;Including Canary Islands, Ceuta and Melilla."                                                                                                                                                                                                                                                                                 
##  [87] "Projected estimate (medium fertility variant).;Including Christmas Island, Cocos (Keeling) Islands and Norfolk Island."                                                                                                                                                                                                                                                      
##  [88] "Projected estimate (medium fertility variant).;Including Crimea."                                                                                                                                                                                                                                                                                                            
##  [89] "Projected estimate (medium fertility variant).;Including East Jerusalem."                                                                                                                                                                                                                                                                                                    
##  [90] "Projected estimate (medium fertility variant).;Including Falkland Islands (Malvinas)."                                                                                                                                                                                                                                                                                       
##  [91] "Projected estimate (medium fertility variant).;Including Liechtenstein and Monaco."                                                                                                                                                                                                                                                                                          
##  [92] "Projected estimate (medium fertility variant).;Including Marshall Islands, Nauru, Northern Mariana Islands and Palau."                                                                                                                                                                                                                                                       
##  [93] "Projected estimate (medium fertility variant).;Including Nagorno-Karabakh."                                                                                                                                                                                                                                                                                                  
##  [94] "Projected estimate (medium fertility variant).;Including Pitcairn."                                                                                                                                                                                                                                                                                                          
##  [95] "Projected estimate (medium fertility variant).;Including Sabah and Sarawak."                                                                                                                                                                                                                                                                                                 
##  [96] "Projected estimate (medium fertility variant).;Including Saint Helena."                                                                                                                                                                                                                                                                                                      
##  [97] "Projected estimate (medium fertility variant).;Including Svalbard and Jan Mayen Islands."                                                                                                                                                                                                                                                                                    
##  [98] "Projected estimate (medium fertility variant).;Including the Faroe Islands and the Isle of Man."                                                                                                                                                                                                                                                                             
##  [99] "Projected estimate (medium fertility variant).;Including the Transnistria region."                                                                                                                                                                                                                                                                                           
## [100] "Projected estimate (medium fertility variant).;Including Zanzibar."                                                                                                                                                                                                                                                                                                          
## [101] "Projected estimate (medium fertility variant).;Refers to the United Kingdom of Great Britain and Northern Ireland. For statistical purposes, the data for United Kingdom do not include Anguilla, Bermuda, British Virgin Islands, Cayman Islands, Channel Islands, Falkland Islands (Malvinas), Gibraltar, Isle of Man, Montserrat, Saint Helena, Turks and Caicos Islands."
## [102] "Projected estimate (medium fertility variant).;Refers to the whole country."                                                                                                                                                                                                                                                                                                 
## [103] "Refers to Guernsey and Jersey."                                                                                                                                                                                                                                                                                                                                              
## [104] "Refers to habitable area. Excludes Saint Lucia's Forest Reserve."                                                                                                                                                                                                                                                                                                            
## [105] "Refers to the United Kingdom of Great Britain and Northern Ireland. For statistical purposes, the data for United Kingdom do not include Anguilla, Bermuda, British Virgin Islands, Cayman Islands, Channel Islands, Falkland Islands (Malvinas), Gibraltar, Isle of Man, Montserrat, Saint Helena, Turks and Caicos Islands."                                               
## [106] "Refers to the whole country."                                                                                                                                                                                                                                                                                                                                                
## [107] "Surface area is 0.44 Km2."                                                                                                                                                                                                                                                                                                                                                   
## [108] "Surface area is based on the 2002 population and housing census."                                                                                                                                                                                                                                                                                                            
## [109] "The land area of Singapore comprises the mainland and other islands."                                                                                                                                                                                                                                                                                                        
## [110] "The total area includes continental areas and islands, and excludes Antarctic area."                                                                                                                                                                                                                                                                                         
## [111] "The total surface is 21 040.79 square kilometres, without taking into account the last ruling of The Hague."

The variable Footnotes is of type character and includes possible observations associated with some data.

levels(as.factor(UN_population$Source))
## [1] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022."                                                                                                                                                                                    
## [2] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision; supplemented by data from the United Nations Statistics Division, New York, Demographic Yearbook 2020 and Secretariat for the Pacific Community (SPC) for small countries or areas, last accessed July 2022."
## [3] "United Nations Statistics Division, New York, \"Demographic Yearbook 2020\" and the demographic statistics database, last accessed June 2022."

The source variable determines the name of the source from which the data for each record was obtained.

We conclude that the table gives us numerical data, by country and time period, on some demographic aspects and its surface area. Information is given about 5 specific time periods. Depending on what we want to do, we may have to do a coercion to change the type of some variables such as the year to type Date, or the variables used to clasify to type factor.

EXCEL format

To read Excel spreadsheets we will use the openxlsx library that allows us to read files hosted locally or on a web server. As an example we are going to read data published by OWID. We can read the OWID data directly on your web server using a script like the following:

df <- read.xlsx("https://nyc3.digitaloceanspaces.com/owid-public/data/co2/owid-co2-data.xlsx",sheet=1)

However, as the database is very large and takes a long time to read over the Internet, we have saved some simplified data on population by country in our local data repository.

owid_country <- read.xlsx("https://ctim.es/AEDV/data/owid_country.xlsx",sheet=1) %>%
   as_tibble()

This table is organized in a simpler way than the UN_population table and simply with the str() function we can identify how it is organized and its content:

str(owid_country) # str() prints the table structure
## tibble [237 × 17] (S3: tbl_df/tbl/data.frame)
##  $ iso_code                  : chr [1:237] "ABW" "AFG" "AGO" "AIA" ...
##  $ location                  : chr [1:237] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
##  $ continent                 : chr [1:237] "North America" "Asia" "Africa" "North America" ...
##  $ population                : num [1:237] 106459 41128772 35588996 15877 2842318 ...
##  $ median_age                : num [1:237] 41.2 18.6 16.8 NA 38 NA 34 31.9 35.7 NA ...
##  $ life_expectancy           : num [1:237] 76.3 64.8 61.1 81.9 78.6 ...
##  $ aged_65_older             : num [1:237] 13.09 2.58 2.4 NA 13.19 ...
##  $ aged_70_older             : num [1:237] 7.45 1.34 1.36 NA 8.64 ...
##  $ gdp_per_capit             : num [1:237] 35974 1804 5819 NA 11803 ...
##  $ extreme_poverty           : num [1:237] NA NA NA NA 1.1 NA NA 0.6 1.8 NA ...
##  $ cardiovasc_death_rat      : num [1:237] NA 597 276 NA 304 ...
##  $ diabetes_prevalence       : num [1:237] 11.62 9.59 3.94 NA 10.08 ...
##  $ female_smokers            : num [1:237] NA NA NA NA 7.1 29 1.2 16.2 1.5 NA ...
##  $ male_smokers              : num [1:237] NA NA NA NA 51.2 37.8 37.4 27.7 52.1 NA ...
##  $ handwashing_facilities    : num [1:237] NA 37.7 26.7 NA NA ...
##  $ hospital_beds_per_thousand: num [1:237] NA 0.5 NA NA 2.89 NA 1.2 5 4.2 NA ...
##  $ human_development_index   : num [1:237] NA 0.511 0.581 NA 0.795 0.868 0.89 0.845 0.776 NA ...

We conclude that this table gives us information by country on some indicators. The country identification code (variable iso_code) is 3 letters, instead of 3 numbers as used in the previous UN table. This represents a problem that will have to be solved if we want to combine the two tables. Note that the organization of this table and the previous one is very different. In this one, each row of the table contains the information of all the country’s indicators, each of them corresponds to a different variable, and in the previous table, each row contains the data of a single indicator, and the names of the indicators are in another variable. This last organization (the one in table UN) is called tidy data and we will talk about it in more detail at the end of this chapter.

PC-Axis format

The PC-Axis format, specialized in storing statistical data, is the one used by INE and ISTAD to export data. It can be edited with any plain text editor (such as NOTEPAD). To read this file format we use the read.px function of the pxR library. Next we are going to read population data by municipalities of the Canary Islands previously stored in the local data repository data based on a query made in ISTAD.

istac_population <- read.px("https://ctim.es/AEDV/data/PoblacionMunicipiosCanarios.px")%>%
   as_tibble()
istac_population %>%
   print(n=5)
## # A tibble: 2,112 × 5
##   Sexos       Años  Edades.año.a.año Municipios.por.islas   value
##   <fct>       <fct> <fct>            <fct>                  <dbl>
## 1 AMBOS SEXOS 2021  TOTAL            CANARIAS             2172944
## 2 AMBOS SEXOS 2020  TOTAL            CANARIAS             2175952
## 3 AMBOS SEXOS 2019  TOTAL            CANARIAS             2153389
## 4 AMBOS SEXOS 2018  TOTAL            CANARIAS             2127685
## 5 AMBOS SEXOS 2017  TOTAL            CANARIAS             2108121
## # ℹ 2,107 more rows
levels(as.factor(istac_population$Sexos))
## [1] "AMBOS SEXOS"
levels(as.factor(istac_population$Años))
##  [1] "2021" "2020" "2019" "2018" "2017" "2016" "2015" "2014" "2013" "2012"
## [11] "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003" "2002"
## [21] "2001" "2000"
levels(as.factor(istac_population$Edades.año.a.año))
## [1] "TOTAL"
levels(as.factor(istac_population$Municipios.por.islas))
##  [1] "CANARIAS"                     "LANZAROTE"                   
##  [3] "Arrecife"                     "Haría"                       
##  [5] "San Bartolomé"                "Teguise"                     
##  [7] "Tías"                         "Tinajo"                      
##  [9] "Yaiza"                        "FUERTEVENTURA"               
## [11] "Antigua"                      "Betancuria"                  
## [13] "Oliva (La)"                   "Pájara"                      
## [15] "Puerto del Rosario"           "Tuineje"                     
## [17] "GRAN CANARIA"                 "Agaete"                      
## [19] "Agüimes"                      "Artenara"                    
## [21] "Arucas"                       "Firgas"                      
## [23] "Gáldar"                       "Ingenio"                     
## [25] "Mogán"                        "Moya"                        
## [27] "Palmas de Gran Canaria (Las)" "San Bartolomé de Tirajana"   
## [29] "Aldea de San Nicolás (La)"    "Santa Brígida"               
## [31] "Santa Lucía"                  "Santa María de Guía"         
## [33] "Tejeda"                       "Telde"                       
## [35] "Teror"                        "Valsequillo"                 
## [37] "Valleseco"                    "Vega de San Mateo"           
## [39] "TENERIFE"                     "Adeje"                       
## [41] "Arafo"                        "Arico"                       
## [43] "Arona"                        "Buenavista del Norte"        
## [45] "Candelaria"                   "Fasnia"                      
## [47] "Garachico"                    "Granadilla de Abona"         
## [49] "Guancha (La)"                 "Guía de Isora"               
## [51] "Güimar"                       "Icod de Los Vinos"           
## [53] "Laguna (La)"                  "Matanza de Acentejo (La)"    
## [55] "Orotava (La)"                 "Puerto de La Cruz"           
## [57] "Realejos (Los)"               "Rosario (El)"                
## [59] "San Juan de La Rambla"        "San Miguel"                  
## [61] "Santa Cruz de Tenerife"       "Santa Úrsula"                
## [63] "Santiago del Teide"           "Sauzal (El)"                 
## [65] "Silos (Los)"                  "Tacoronte"                   
## [67] "Tanque (El)"                  "Tegueste"                    
## [69] "Victoria de Acentejo (La)"    "Vilaflor"                    
## [71] "LA GOMERA"                    "Agulo"                       
## [73] "Alajeró"                      "Hermigua"                    
## [75] "San Sebastián de La Gomera"   "Valle Gran Rey"              
## [77] "Vallehermoso"                 "LA PALMA"                    
## [79] "Barlovento"                   "Breña Alta"                  
## [81] "Breña Baja"                   "Fuencaliente"                
## [83] "Garafía"                      "Llanos de Aridane (Los)"     
## [85] "Paso (El)"                    "Puntagorda"                  
## [87] "Puntallana"                   "San Andrés y Sauces"         
## [89] "Santa Cruz de La Palma"       "Tazacorte"                   
## [91] "Tijarafe"                     "Villa de Mazo"               
## [93] "EL HIERRO"                    "Frontera"                    
## [95] "Valverde"                     "Pinar de El Hierro (El)"

We conclude that this table gives us information about the population of the municipalities of the Canary Islands between the years 2000 and 2021, including both sexes and adding all ages. No identifying code of the municipality appears.

The INE and ISTAD databases are obtained from a query made on their web server, and it is the query that generates the file with the data. We can edit the file in text mode to see some metadata about the table that is lost when loading it in a tibble. If reading the PC-Axis files fails, or generates problems, we can always download the query in another file format, for example an Excel sheet, and try to load the data from the Excel sheet.

Text format

Sometimes, when the data is provided by a text file, it is not organized as a table and reading the file using the functions previously seen can give rise to errors. In that case we have to correct and update the file before loading it as a table. This can be done by opening the file with a text editor and manipulating it manually. An alternative, and automatic, way to do it from R is to use the read_lines function that loads the text file as a vector containing the lines of the file and does not assume that it has a table format with identifiable variables and records. We can edit this vector and write it back updated using the write_lines function.

UN_population2 <- read_lines("https://ctim.es/AEDV/data/UN_population.csv")
# remove the first line
UN_population2 <- UN_population2[-1]
# replace the string "Region/Country/Area" with "location"
UN_population2 <- gsub('Region/Country/Area','location',UN_population2)
# we eliminate all the lines that contain the string "Total"
UN_population2 <- UN_population2[str_detect(UN_population2,"Total")==FALSE]
# write the updated file
write_lines(UN_population2,"UN_population2.csv")

ZIP format

The readl library allows us to manage ZIP compressed files stored locally (not hosted on a server). Let’s look at an example with data from the file “data/UNCTAD.zip” obtained from UNCTAD, an organization linked to the United Nations dedicated to Trade and Development. First let’s see what documents this zip file includes:

unzip("../data/UNCTAD.zip", list = TRUE)
##                                                                                        Name
## 1                                                  US_GovExpenditures_ST202306281615_v1.csv
## 2 Glossary_Classification of the functions of government (COFOG) - Statistics Explained.pdf
## 3                                              UNCTADstat - Productive Capacities Index.pdf
## 4                                                                      WordlOfDebt_data.csv
## 5                                                           Un mundo endeudado _ UNCTAD.pdf
## 6                                                                 US_PCI_20230704063200.csv
##    Length                Date
## 1 1890332 2023-09-18 11:25:00
## 2  670903 2023-09-01 10:56:00
## 3 2446315 2023-09-01 10:50:00
## 4 4248979 2023-09-01 11:19:00
## 5  659704 2023-09-01 11:25:00
## 6 3717173 2023-07-18 17:38:00

We note that this zip file includes 3 CSV files with data and 3 explanatory PDF documents. Let’s open one of the data files:

WordlOfDebt_data <- read_csv(unzip("../data/UNCTAD.zip", "WordlOfDebt_data.csv")) %>%
   as_tibble()
str(WordlOfDebt_data)
## tibble [28,455 × 10] (S3: tbl_df/tbl/data.frame)
##  $ ID                     : chr [1:28455] "AF" "AL" "DZ" "AO" ...
##  $ Name                   : chr [1:28455] "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Region                 : chr [1:28455] "Developing Asia and Oceania" "Europe and Central Asia*" "Africa" "Africa" ...
##  $ World Bank income group: chr [1:28455] "Low income countries" "Upper middle income countries" "Lower middle income countries" "Lower middle income countries" ...
##  $ Development status     : chr [1:28455] "Developing countries" "Developed countries" "Developing countries" "Developing countries" ...
##  $ LDC                    : num [1:28455] 1 0 0 1 0 0 0 0 0 0 ...
##  $ SIDS                   : num [1:28455] 0 0 0 0 1 0 0 0 0 0 ...
##  $ Indicator              : chr [1:28455] "Public debt as a share of GDP" "Public debt as a share of GDP" "Public debt as a share of GDP" "Public debt as a share of GDP" ...
##  $ Year                   : chr [1:28455] "2010" "2010" "2010" "2010" ...
##  $ Value                  : num [1:28455] 0.077 0.577 0.105 0.372 0.897 ...
levels(as.factor(WordlOfDebt_data$Region))
## [1] "Africa"                          "Developing Asia and Oceania"    
## [3] "Europe and Central Asia*"        "Latin America and the Caribbean"
levels(as.factor(WordlOfDebt_data$`World Bank income group`))
## [1] "High income countries"         "Low income countries"         
## [3] "Lower middle income countries" "Upper middle income countries"
levels(as.factor(WordlOfDebt_data$`Development status`))
## [1] "Developed countries"  "Developing countries"
levels(as.factor(WordlOfDebt_data$`Indicator`))
##  [1] "Bilateral creditors as a share of external public debt"     
##  [2] "External public debt as a share of GDP"                     
##  [3] "External public debt in US$ billions"                       
##  [4] "External public debt in US$ per capita"                     
##  [5] "Multilateral creditors as a share of external public debt"  
##  [6] "Private creditors as a share of external public debt"       
##  [7] "Public debt as a share of GDP"                              
##  [8] "Public debt in US$ billions"                                
##  [9] "Public debt in US$ per capita"                              
## [10] "Public debt interest payments as a share of GDP"            
## [11] "Public debt interest payments as a share of revenues"       
## [12] "Public debt interest payments in US$ per capita"            
## [13] "Public education expenditure as a share of GDP"             
## [14] "Public health expenditure as share of GDP"                  
## [15] "Public investment expenditure as a share of GDP"            
## [16] "Ratio of public interest payments to education expenditure" 
## [17] "Ratio of public interest payments to health expenditure"    
## [18] "Ratio of public interest payments to investment expenditure"
levels(as.factor(WordlOfDebt_data$`Year`))
##  [1] "2010"      "2010-2012" "2011"      "2012"      "2013"      "2014"     
##  [7] "2014-2016" "2015"      "2016"      "2017"      "2018"      "2019"     
## [13] "2019-2021" "2020"      "2021"      "2022"

We conclude that this table provides information by country and year (sometimes grouped) on some indicators on their debt. Additionally, countries are classified by their income and level of development.

3.3 Data transformation

The dplyr package

The dplyr package, found within the tidyverse library, is commonly used to manage/transform data. As a summary of this section it may be useful to look at the following summary sheet. We will see, below, the most important functionalities of this package.

filter()

Filters table rows based on conditions. For example, with the following instruction we filter the countries in Europe with a population greater than 40 million inhabitants

owid_country %>%
   filter(continent=="Europe" & population>4e7)
## # A tibble: 6 × 17
##   iso_code location       continent population median_age life_expectancy
##   <chr>    <chr>          <chr>          <dbl>      <dbl>           <dbl>
## 1 DEU      Germany        Europe      83369840       46.6            81.3
## 2 ESP      Spain          Europe      47558632       45.5            83.6
## 3 FRA      France         Europe      67813000       42              82.7
## 4 GBR      United Kingdom Europe      67508936       40.8            81.3
## 5 ITA      Italy          Europe      59037472       47.9            83.5
## 6 RUS      Russia         Europe     144713312       39.6            72.6
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>

With the str_detect function we can filter the table based on the content of a string variable. For example, below we filter the countries in the world whose name contains the string “ma” (case-insensitive)

owid_country %>%
   filter(str_detect(location, regex("ma", ignore_case = TRUE)))
## # A tibble: 29 × 17
##    iso_code location             continent population median_age life_expectancy
##    <chr>    <chr>                <chr>          <dbl>      <dbl>           <dbl>
##  1 BHS      Bahamas              North Am…     409989       34.3            73.9
##  2 CYM      Cayman Islands       North Am…      68722       NA              83.9
##  3 DEU      Germany              Europe      83369840       46.6            81.3
##  4 DNK      Denmark              Europe       5882259       42.3            80.9
##  5 GTM      Guatemala            North Am…   17843914       22.9            74.3
##  6 IMN      Isle of Man          Europe         84534       NA              81.4
##  7 JAM      Jamaica              North Am…    2827382       31.4            74.5
##  8 MAC      Macao                Asia          695180       39.2            84.2
##  9 MAF      Saint Martin (Frenc… North Am…      31816       NA              82.1
## 10 MDG      Madagascar           Africa      29611718       19.6            67.0
## # ℹ 19 more rows
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>

select()

Select some variables from the table. Fields can be selected individually or by groups using the : symbol.

owid_country %>%
   filter(continent=="Europe" & population>4e7) %>%
   arrange(desc(population)) %>%
   select(location,life_expectancy:gdp_per_capit)
## # A tibble: 6 × 5
##   location       life_expectancy aged_65_older aged_70_older gdp_per_capit
##   <chr>                    <dbl>         <dbl>         <dbl>         <dbl>
## 1 Russia                    72.6          14.2          9.39        24766.
## 2 Germany                   81.3          21.5         16.0         45229.
## 3 France                    82.7          19.7         13.1         38606.
## 4 United Kingdom            81.3          18.5         12.5         39753.
## 5 Italy                     83.5          23.0         16.2         35220.
## 6 Spain                     83.6          19.4         13.8         34272.

select is also used to remove variables from the table by putting in front of the variable the - sign

owid_country %>%
   filter(continent=="Europe" & population>4e7) %>%
   arrange(desc(population)) %>%
   select(-iso_code,-continent)
## # A tibble: 6 × 15
##   location     population median_age life_expectancy aged_65_older aged_70_older
##   <chr>             <dbl>      <dbl>           <dbl>         <dbl>         <dbl>
## 1 Russia        144713312       39.6            72.6          14.2          9.39
## 2 Germany        83369840       46.6            81.3          21.5         16.0 
## 3 France         67813000       42              82.7          19.7         13.1 
## 4 United King…   67508936       40.8            81.3          18.5         12.5 
## 5 Italy          59037472       47.9            83.5          23.0         16.2 
## 6 Spain          47558632       45.5            83.6          19.4         13.8 
## # ℹ 9 more variables: gdp_per_capit <dbl>, extreme_poverty <dbl>,
## #   cardiovasc_death_rat <dbl>, diabetes_prevalence <dbl>,
## #   female_smokers <dbl>, male_smokers <dbl>, handwashing_facilities <dbl>,
## #   hospital_beds_per_thousand <dbl>, human_development_index <dbl>

arrange()

Sort the table based on one or more variables

owid_country %>%
   filter(continent=="Europe" & population>4e7) %>%
   arrange(desc(population)) # desc() is used to sort in descending order.
## # A tibble: 6 × 17
##   iso_code location       continent population median_age life_expectancy
##   <chr>    <chr>          <chr>          <dbl>      <dbl>           <dbl>
## 1 RUS      Russia         Europe     144713312       39.6            72.6
## 2 DEU      Germany        Europe      83369840       46.6            81.3
## 3 FRA      France         Europe      67813000       42              82.7
## 4 GBR      United Kingdom Europe      67508936       40.8            81.3
## 5 ITA      Italy          Europe      59037472       47.9            83.5
## 6 ESP      Spain          Europe      47558632       45.5            83.6
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>

mutate()

Calculates new variables or modifies existing one:

owid_country %>%
   mutate( gdp = gdp_per_capit * population) %>%
   select(location,population,gdp) %>%
   arrange(desc(gdp)) %>%
   head(10) %>%
   knitr::kable( caption = "Countries sorted by GDP", format = "html", table.attr = "style='width:60%;'")
Table 3.1: Countries sorted by GDP
location population gdp
China 1425887360 2.182850e+13
United States 338289856 1.834392e+13
India 1417173120 9.107710e+12
Japan 123951696 4.834392e+12
Germany 83369840 3.770755e+12
Russia 144713312 3.583963e+12
Indonesia 275501344 3.082514e+12
Brazil 215313504 3.036664e+12
United Kingdom 67508936 2.683699e+12
France 67813000 2.617966e+12

relocate()

Change the order of the columns

owid_country %>%
   relocate(location) %>% # puts location first
   relocate(continent,.after=location) %>%
   relocate(life_expectancy,.after=continent) %>%
   print(n=5)
## # A tibble: 237 × 17
##   location    continent     life_expectancy iso_code population median_age
##   <chr>       <chr>                   <dbl> <chr>         <dbl>      <dbl>
## 1 Aruba       North America            76.3 ABW          106459       41.2
## 2 Afghanistan Asia                     64.8 AFG        41128772       18.6
## 3 Angola      Africa                   61.2 AGO        35588996       16.8
## 4 Anguilla    North America            81.9 AIA           15877       NA  
## 5 Albania     Europe                   78.6 ALB         2842318       38  
## # ℹ 232 more rows
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>

rename()

Change the name of the variables

owid_country %>%
   filter(continent=="Europe" & population>4e7) %>%
   arrange(desc(population)) %>%
   select(location,population,median_age) %>%
   rename(country = location)%>%
   knitr::kable( caption = "European countries sorted by population", format = "html", table.attr = "style='width:60%;'")
Table 3.2: European countries sorted by population
country population median_age
Russia 144713312 39.6
Germany 83369840 46.6
France 67813000 42.0
United Kingdom 67508936 40.8
Italy 59037472 47.9
Spain 47558632 45.5

str_replace

This function replaces one string with another within a vector of strings

owid_country%>%
   mutate(location = str_replace(location,"United States","USA")) %>%
   tail(20)
## # A tibble: 20 × 17
##    iso_code location             continent population median_age life_expectancy
##    <chr>    <chr>                <chr>          <dbl>      <dbl>           <dbl>
##  1 TWN      Taiwan               Asia        23893396       42.2            80.5
##  2 TZA      Tanzania             Africa      65497752       17.7            65.5
##  3 UGA      Uganda               Africa      47249588       16.4            63.4
##  4 UKR      Ukraine              Europe      39701744       41.4            72.1
##  5 URY      Uruguay              South Am…    3422796       35.6            77.9
##  6 USA      USA                  North Am…  338289856       38.3            78.9
##  7 UZB      Uzbekistan           Asia        34627648       28.2            71.7
##  8 VAT      Vatican              Europe           808       NA              75.1
##  9 VCT      Saint Vincent and t… North Am…     103959       31.8            72.5
## 10 VEN      Venezuela            South Am…   28301700       29              72.1
## 11 VGB      British Virgin Isla… North Am…      31332       NA              79.1
## 12 VIR      USA Virgin Islands   North Am…      99479       42.2            80.6
## 13 VNM      Vietnam              Asia        98186856       32.6            75.4
## 14 VUT      Vanuatu              Oceania       326744       23.1            70.5
## 15 WLF      Wallis and Futuna    Oceania        11596       NA              79.9
## 16 WSM      Samoa                Oceania       222390       22              73.3
## 17 YEM      Yemen                Asia        33696612       20.3            66.1
## 18 ZAF      South Africa         Africa      59893884       27.3            64.1
## 19 ZMB      Zambia               Africa      20017670       17.7            63.9
## 20 ZWE      Zimbabwe             Africa      16320539       19.6            61.5
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>

To change a value in the entire tibble we can do:

owid_country[owid_country=="United States"] <- "USA"
owid_country%>%
   select(iso_code,location)%>%
   filter(location=="USA")
## # A tibble: 1 × 2
##   iso_code location
##   <chr>    <chr>   
## 1 USA      USA

group_by() and summarise()

group_by groups the records by one or more variables and later summarise allows us to perform operations on the groupings obtained. The variable on which the grouping is performed always appears in the result of summarise. In the following example, GDP per inhabitant is calculated by grouping it by continent. That is, the data for each country is associated with the group that shares the same continent.

owid_country %>%
   group_by(continent) %>%
   summarise(
     populationContinent=sum(population),
     gdp_per_capita=sum(na.omit(gdp_per_capit*population))/populationContinent
   ) %>%
   arrange(desc(gdp_per_capita)) %>%
   knitr::kable( caption = "GDP per capita", format = "html", table.attr = "style='width:40%;'")
Table 3.3: GDP per capita
continent populationContinent gdp_per_capita
North America 600323657 38468.345
Europe 745629155 32792.373
Oceania 45038907 31272.576
South America 436816679 14519.141
Asia 4721455390 12148.167
Africa 1426736614 4544.862

In this case the sum function has been used so that the result is the sum of the values of the countries of each continent, other functions can be used such as min, max, mean, median, sd, first, last, prod or n (number of values in the vector).

Combining data from multiple tables

Frequently you have to combine information from several tables. The join type we will use here is the “left_join” which keeps all the records in the first table joining them with those in the second table. As a first example, we are going to combine two OWID tables using iso_code as a common variable, which is a 3-letter code that identifies each country. To do this we will use the left_join function from the dplyr library.

owid_co2 <- read_csv(unzip("../data/owid-co2.zip", "owid-co2-data.csv")) %>%
   as_tibble()
left_join(owid_country,owid_co2,by = "iso_code")
## # A tibble: 41,294 × 95
##    iso_code location continent     population.x median_age life_expectancy
##    <chr>    <chr>    <chr>                <dbl>      <dbl>           <dbl>
##  1 ABW      Aruba    North America       106459       41.2            76.3
##  2 ABW      Aruba    North America       106459       41.2            76.3
##  3 ABW      Aruba    North America       106459       41.2            76.3
##  4 ABW      Aruba    North America       106459       41.2            76.3
##  5 ABW      Aruba    North America       106459       41.2            76.3
##  6 ABW      Aruba    North America       106459       41.2            76.3
##  7 ABW      Aruba    North America       106459       41.2            76.3
##  8 ABW      Aruba    North America       106459       41.2            76.3
##  9 ABW      Aruba    North America       106459       41.2            76.3
## 10 ABW      Aruba    North America       106459       41.2            76.3
## # ℹ 41,284 more rows
## # ℹ 89 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## #   gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## #   diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## #   handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## #   human_development_index <dbl>, country <chr>, year <dbl>,
## #   population.y <dbl>, gdp <dbl>, cement_co2 <dbl>, …

The statement by = "iso_code" is equivalent to join_by(iso_code == iso_code). The second has the advantage that we could use a common variable with different names in both tables.

The left_join function only works when the common variable in both tables is identical, if the common variable is a string (for example the name of a municipality) there may be small differences when writing the municipality in both tables which makes the comparison fail. Ideally, in all tables, each record should be accompanied of a unique and universal identification code. For example, the INE uses an identification code for each municipality, but unfortunately, this code does not accompany systematically to the tables that result from making a query in the INE or in the ISTAC related to municipalities. And what’s worse, the names of the municipalities can change slightly from one query to another, and therefore using the name of the municipality as an identifier can cause problems. To resolve these problems, the first thing to do when reading a INE or ISTAC table is to try to associate the INE universal identification code from the name of the municipality (if we are managing municipalities ). We will take the INE codes of the municipalities from the following table that we filter to only be left with the Canarian municipalities.

uclm_mun <- read.xlsx("https://ctim.es/AEDV/data/uclm-list-mun-2012.xlsx",sheet=1) %>%
   as_tibble() %>% filter(CA=="05")
uclm_mun %>%
   str()
## tibble [88 × 11] (S3: tbl_df/tbl/data.frame)
##  $ codine          : chr [1:88] "35001" "35002" "35003" "35004" ...
##  $ Municipio       : chr [1:88] "Agaete" "Agüimes" "Antigua" "Arrecife" ...
##  $ Superficie      : num [1:88] 45.5 79.3 250.6 22.7 66.7 ...
##  $ Capitalidad     : chr [1:88] "Agaete" "Agüimes" "Antigua" "Arrecife" ...
##  $ Año             : num [1:88] NA NA NA NA NA NA NA NA NA NA ...
##  $ CA              : chr [1:88] "05" "05" "05" "05" ...
##  $ Autonomía       : chr [1:88] "Canarias" "Canarias" "Canarias" "Canarias" ...
##  $ CP              : chr [1:88] "35" "35" "35" "35" ...
##  $ Provincia       : chr [1:88] "Palmas (Las)" "Palmas (Las)" "Palmas (Las)" "Palmas (Las)" ...
##  $ CPJ             : chr [1:88] "3504" "3508" "3503" "3501" ...
##  $ Partido_Judicial: chr [1:88] "Santa María de Guía de Gran Canaria" "Santa Lucía de Tirajana" "Puerto del Rosario" "Arrecife" ...

Next we are going to read a table from ISTAC on municipalities. We have modified the original table by adding variables with the names of some municipalities slightly changed to have several options when comparing the names: :

istac_municipios <- read.xlsx("https://ctim.es/AEDV/data/istac_municipios.xlsx") %>%
   as_tibble()
istac_municipios %>%
   filter(is.na(nombre3)==FALSE) %>%
   print(n=10)
## # A tibble: 19 × 5
##    nombre                       nombre2                 nombre3 isla  superficie
##    <chr>                        <chr>                   <chr>   <chr>      <dbl>
##  1 Aldea de San Nicolás (La)    La Aldea de San Nicolás La Ald… Gran…     124.  
##  2 Palmas de Gran Canaria (Las) Las Palmas de Gran Can… Las Pa… Gran…     101.  
##  3 Llanos de Aridane (Los)      Los Llanos de Aridane   Los Ll… La P…      35.8 
##  4 Puerto de la Cruz            <NA>                    P. de … Tene…       8.73
##  5 Puerto del Rosario           <NA>                    P. del… Fuer…     290.  
##  6 San Andrés y Sauces          <NA>                    S. And… La P…      42.8 
##  7 San Bartolomé                <NA>                    S. Bar… Lanz…      40.9 
##  8 San Bartolomé de Tirajana    San Bartolomé de Tiraj… S. B. … Gran…     333.  
##  9 San Cristóbal de La Laguna   <NA>                    La Lag… Tene…     102.  
## 10 San Juan de la Rambla        <NA>                    S.J. d… Tene…      20.7 
## # ℹ 9 more rows

Since the istac_municipios table does not have the code INE, we are going to add it. To do this, first of all, we are going to check if through the names of the municipalities in both tables it is possible to find a “matching” without errors between the two tables. For this we are going to use the LeftJoinNearestString function implemented by us in the file utilidades.R which makes a “left join” between two tibbles of strings u and v that correspond to the possible names that we are going to compare in the 2 tables using the adist function. This function returns a tibble with the following variables:

  • pos: vector with the index of the record in table v that has the smallest distance from that of table u

  • value1: the string within table u that most closely resembles the string in table v

  • value2: the string within table v that most closely resembles the one in table u

  • dis : the distance between the stringsvalue1andvalue2`

This function has as additional parameters, the parameters ignore.case and partial which by default are assigned as ignore.case=FALSE and partial=FALSE, but which can be modified by adding them in the function call.

Note that this function does not generate the combined tables, but with the pos variable we can match the records from one table to another. Therefore, based on this information we can combine the tables as we wish. The `LeftJoinNearestString function has a high computational cost and is only useful for relatively small databases. If applied to tables with many records it can take a long time.

It must be taken into account that the comparison may fail and therefore it is advisable to verify and correct the results. Next we do the “matching” between the names of the municipalities in both tables and print the results of those with a distance greater than zero.

source("utilidades.R")
res <- LeftJoinNearestString(istac_municipios%>%select(nombre:nombre3),uclm_mun%>%select(Municipio))

res%>%
   arrange(desc(dis)) %>%
   filter(dis>0) %>%
   print(n=100)
## # A tibble: 15 × 4
##      pos value1                       value2                        dis
##    <int> <chr>                        <chr>                       <dbl>
##  1    88 Pinar del Hierro (El)        Pinar de El Hierro, El          5
##  2    61 Paso (El)                    Paso, El                        3
##  3    66 Rosario (El)                 Rosario, El                     3
##  4    75 Sauzal (El)                  Sauzal, El                      3
##  5    78 Tanque (El)                  Tanque, El                      3
##  6    20 Aldea de San Nicolás (La)    Aldea de San Nicolás, La        3
##  7    52 Guancha (La)                 Guancha, La                     3
##  8    59 Matanza de Acentejo (La)     Matanza de Acentejo, La         3
##  9    14 Oliva (La)                   Oliva, La                       3
## 10    60 Orotava (La)                 Orotava, La                     3
## 11    85 Victoria de Acentejo (La)    Victoria de Acentejo, La        3
## 12    16 Palmas de Gran Canaria (Las) Palmas de Gran Canaria, Las     3
## 13    58 Llanos de Aridane (Los)      Llanos de Aridane, Los          3
## 14    65 Realejos (Los)               Realejos, Los                   3
## 15    76 Silos (Los)                  Silos, Los                      3

We observe, inspecting the result, that the matching has no errors and we proceed to add a variable to istac_municipios with the code of each municipality and save the new table on disk. This way, in the future we could directly use this new table with the INE code.

ine_code <- integer(length(istac_municipios$nombre))
for(i in 1:length(ine_code) ){
   ine_code[i] <- uclm_mun$codine[res$pos[i]]
}
istac_municipios$ine_code <- ine_code
istac_municipios %>%
   relocate(ine_code) %>%
   write.xlsx(file="istac_municipios_ine_codes.xlsx")

Now we are going to try to do the same for the istac_population table that we previously filtered to keep the data for the year 2021

istac_population2021 <- istac_population %>%
   filter(Años==2021) %>%
   mutate(Municipalities.by.islands=as.character(Municipios.por.islas))
istac_population2021
## # A tibble: 96 × 6
##    Sexos       Años  Edades.año.a.año Municipios.por.islas   value
##    <fct>       <fct> <fct>            <fct>                  <dbl>
##  1 AMBOS SEXOS 2021  TOTAL            CANARIAS             2172944
##  2 AMBOS SEXOS 2021  TOTAL            LANZAROTE             156189
##  3 AMBOS SEXOS 2021  TOTAL            Arrecife               64497
##  4 AMBOS SEXOS 2021  TOTAL            Haría                   5365
##  5 AMBOS SEXOS 2021  TOTAL            San Bartolomé          19058
##  6 AMBOS SEXOS 2021  TOTAL            Teguise                23044
##  7 AMBOS SEXOS 2021  TOTAL            Tías                   20801
##  8 AMBOS SEXOS 2021  TOTAL            Tinajo                  6447
##  9 AMBOS SEXOS 2021  TOTAL            Yaiza                  16977
## 10 AMBOS SEXOS 2021  TOTAL            FUERTEVENTURA         119662
## # ℹ 86 more rows
## # ℹ 1 more variable: Municipalities.by.islands <chr>
res <- LeftJoinNearestString(istac_municipios%>%select(nombre:nombre3),istac_population2021%>%select(Municipios.por.islas))

res%>%
   arrange(desc(dis)) %>%
   filter(dis>0) %>%
   print(n=100)
## # A tibble: 88 × 4
##      pos value1              value2   dis
##    <int> <chr>               <chr>  <dbl>
##  1     1 Santa María de Guía 1         19
##  2     1 Las Palmas de G.C.  1         18
##  3     1 S. Andrés y Sauces  1         18
##  4     1 S. B. de Tirajana   1         17
##  5     1 S.J. de la Rambla   1         17
##  6     1 S.S. de la Gomera   1         17
##  7     1 S.C. de la Palma    1         16
##  8     1 S.C. de Tenerife    1         16
##  9     1 P. del Rosario      1         14
## 10     1 Valle Gran Rey      1         14
## 11     1 Guía de Isora       1         13
## 12     1 P. de la Cruz       1         13
## 13     1 Fuencaliente        1         12
## 14     1 Los Realejos        1         12
## 15     1 S. Bartolomé        1         12
## 16     1 S. del Teide        1         12
## 17     1 Vallehermoso        1         12
## 18     1 La Victoria         1         11
## 19     1 Santa Lucía         1         11
## 20     1 Valsequillo         1         11
## 21     1 Barlovento          1         10
## 22     1 Betancuria          1         10
## 23     1 Breña Alta          1         10
## 24     1 Breña Baja          1         10
## 25     1 Buenavista          1         10
## 26     1 Candelaria          1         10
## 27     1 El Rosario          1         10
## 28     1 Granadilla          1         10
## 29     1 La Guancha          1         10
## 30     1 La Matanza          1         10
## 31     1 La Orotava          1         10
## 32     1 Los Llanos          1         10
## 33     1 Puntagorda          1         10
## 34     1 Puntallana          1         10
## 35     1 San Miguel          1         10
## 36     1 S. Brígida          1         10
## 37     1 El Sauzal           1          9
## 38     1 El Tanque           1          9
## 39     1 Garachico           1          9
## 40     1 Los Silos           1          9
## 41     1 La Laguna           1          9
## 42     1 S. Úrsula           1          9
## 43     1 Tacoronte           1          9
## 44     1 Tazacorte           1          9
## 45     1 Valleseco           1          9
## 46     1 San Mateo           1          9
## 47     1 Arrecife            1          8
## 48     1 Artenara            1          8
## 49     1 Frontera            1          8
## 50     1 Hermigua            1          8
## 51     1 La Aldea            1          8
## 52     1 La Oliva            1          8
## 53     1 El Pinar            1          8
## 54     1 Tegueste            1          8
## 55     1 Tijarafe            1          8
## 56     1 Valverde            1          8
## 57     1 Vilaflor            1          8
## 58     1 Agüimes             1          7
## 59     1 Alajeró             1          7
## 60     1 Antigua             1          7
## 61     1 El Paso             1          7
## 62     1 Garafía             1          7
## 63     1 Ingenio             1          7
## 64     1 Teguise             1          7
## 65     1 Tuineje             1          7
## 66     1 Agaete              1          6
## 67     1 Arucas              1          6
## 68     1 Fasnia              1          6
## 69     1 Firgas              1          6
## 70     1 Gáldar              1          6
## 71     1 Güímar              1          6
## 72     1 Pájara              1          6
## 73     1 Tejeda              1          6
## 74     1 Tinajo              1          6
## 75     1 Adeje               1          5
## 76     1 Agulo               1          5
## 77     1 Arafo               1          5
## 78     1 Arico               1          5
## 79     1 Arona               1          5
## 80     1 Haría               1          5
## 81     1 Mogán               1          5
## 82     1 Telde               1          5
## 83     1 Teror               1          5
## 84     1 Yaiza               1          5
## 85     1 Icos                1          4
## 86     1 Moya                1          4
## 87     1 Tías                1          4
## 88     1 Mazo                1          4

In this case, we observe that in the municipality of “La Laguna” the matching fails. A simple way to solve this is to edit by hand the file where the istac_municipios table is stored and add as a possible name of the municipality the name with which it appears in the other table and re-launch the R script to check if the problem has been resolved.

Tidy data

The same information can be organized in many different ways. However, good organization requires meeting certain criteria for the orderly organization of tidy data. These criteria are essentially:

  • Each variable (or field) you measure must correspond to a single column. That is, instances of the same variable cannot be assigned to different columns.

  • Each row can only contain data from a single observation. In other word, in each row you can have several categorical variables (including dates) but a single observation value.

  • There must be a different table for each “class” of variable. That is, heterogeneous data cannot be combined in the same table.

  • If you have multiple tables, there must be a column in each table that allows them to be linked.

Let’s illustrate this concept with an example of a tibble of some population data (in millions) from Spain, France and Germany

data <- tibble(
   year=c(1985,2000,2015),
   Spain=c(38.734,40.75,46.122),
   France=c(55.38,59.387,64.395),
   Germany=c(77.57,81.896,80.689))
data%>%
   knitr::kable( caption = "Unsorted data table", format = "html", table.attr = "style='width:50%;'")
Table 3.4: Unsorted data table
year Spain France Germany
1985 38.734 55.380 77.570
2000 40.750 59.387 81.896
2015 46.122 64.395 80.689

This way of organizing a table is frequently used because it is a natural way we would write the data if we were doing it by hand. However, it does not meet the “tidy data” criteria because the variables with the country names are instances of the country variable, and therefore the table columns should be year, country and population. The pivot_longer function automatically converts the previous tibble into a “tidy” tibble by removing columns and adding rows:

data %>%
   pivot_longer(c(`Spain`, `France`, `Germany`), names_to = "country", values_to = "population")%>%
   knitr::kable( caption = "Ordered data table", format = "html", table.attr = "style='width:30%;'")
Table 3.5: Ordered data table
year country population
1985 Spain 38.734
1985 France 55.380
1985 Germany 77.570
2000 Spain 40.750
2000 France 59.387
2000 Germany 81.896
2015 Spain 46.122
2015 France 64.395
2015 Germany 80.689

An important advantage of tidy data is that there are many processes and functions designed to work with this information organization. In particular, data visualization processes require “tidy data.” Sometimes it is interesting to do the opposite operation to pivot_longer, which is pivot_wider that organizes the table by adding columns and removing rows. This allows you to separate the variables and manage them individually:

data %>%
   pivot_longer(c(`Spain`, `France`, `Germany`), names_to = "country", values_to = "population") %>%
   pivot_wider(names_from = country, values_from = population)
## # A tibble: 3 × 4
##    year Spain France Germany
##   <dbl> <dbl>  <dbl>   <dbl>
## 1  1985  38.7   55.4    77.6
## 2  2000  40.8   59.4    81.9
## 3  2015  46.1   64.4    80.7

We note that doing a pivot_longer and then a pivot_wider takes us to the original table. These operations allow us to manage the same table in the format that interests us at all times.

As a summary of this section it may be useful to look at the following summary sheet.

The organization of the public databases that we use in this course differs quite a bit between them. For example, the OWID tables contain many columns with different variables and each line of the table is an observation of all the variables for a country on a specific date. However, the UN tables contain a column that contains the name of the variable and a single column with the numerical value for said variable, in such a way that each line includes only the observation of one variable. The INE and ISTAC bases have a similar organization to that of UN.

Combining data from UN and OWID

OWID databases use a 3-letter ISO (International Organization for Standardization) code to identify countries. However, UN databases use their own encoding. To combine tables from the two databases we will use a table that provides the UN with the equivalences between different country codes that we have stored in the file “data/UN_code.xlsx”

UN_code <- read.xlsx("https://ctim.es/AEDV/data/UN_code.xlsx",sheet=1) %>%
   as_tibble()
UN_code %>%
   str()
## tibble [246 × 9] (S3: tbl_df/tbl/data.frame)
##  $ ctyCode             : num [1:246] 533 4 24 660 8 20 784 32 51 16 ...
##  $ ISO_A2              : chr [1:246] "AW" "AF" "AO" "AI" ...
##  $ ISO_A3              : chr [1:246] "ABW" "AFG" "AGO" "AIA" ...
##  $ cty.Name.English    : chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
##  $ cty.Fullname.English: chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
##  $ Cty.Abbreviation    : chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
##  $ Cty.Comments        : chr [1:246] "NULL" "NULL" "NULL" "NULL" ...
##  $ Start.Valid.Year    : num [1:246] 1988 1962 1962 1981 1962 ...
##  $ End.Valid.Year      : num [1:246] 2061 2061 2061 2061 2061 ...

In this table, 3 codes appear for each country: a numerical code appears in the ctyCode variable, a 2-character code appears in the ISO_A2 variable, and a 3-character code appears in the ISO_A3 variable.

Next we read a table from UN, we change the name of some variables of interest, we select the variables and do a pivot_wider

UN_population <- as_tibble(read_csv("https://ctim.es/AEDV/data/UN_population.csv",skip=1))%>%
   rename(un_code=`Region/Country/Area`,country=...2) %>%
   select(un_code,country,Year,Series,Value) %>%
   pivot_wider(names_from = Series, values_from = Value)
UN_population %>%
   str()
## tibble [1,056 × 11] (S3: tbl_df/tbl/data.frame)
##  $ un_code                                             : num [1:1056] 1 1 1 1 2 2 2 2 15 15 ...
##  $ country                                             : chr [1:1056] "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
##  $ Year                                                : num [1:1056] 2010 2015 2020 2022 2010 ...
##  $ Population mid-year estimates (millions)            : num [1:1056] 6986 7427 7841 7975 1055 ...
##  $ Population mid-year estimates for males (millions)  : num [1:1056] 3514 3737 3944 4009 526 ...
##  $ Population mid-year estimates for females (millions): num [1:1056] 3471 3689 3897 3967 529 ...
##  $ Sex ratio (males per 100 females)                   : num [1:1056] 101.2 101.3 101.2 101.1 99.3 ...
##  $ Population aged 0 to 14 years old (percentage)      : num [1:1056] 27.1 26.4 25.7 25.3 41.5 41.3 40.5 40.1 32.4 32.8 ...
##  $ Population aged 60+ years old (percentage)          : num [1:1056] 11.1 12.3 13.5 13.9 5 5.2 5.4 5.5 6.6 7.3 ...
##  $ Population density                                  : num [1:1056] 53.6 57 60.1 61.2 35.7 40.6 46 48.3 26.9 29.7 ...
##  $ Surface area (thousand km2)                         : num [1:1056] NA 136162 130094 NA NA ...

Now, using left_join we will join the tables and sort their variables a bit

owid_country %>%
   select(iso_code:population) %>%
   left_join(UN_code,join_by(iso_code==ISO_A3)) %>%
   left_join(UN_population,join_by (ctyCode==un_code)) %>%
   relocate(Year, .after=iso_code) %>%
   str()
## tibble [882 × 22] (S3: tbl_df/tbl/data.frame)
##  $ iso_code                                            : chr [1:882] "ABW" "ABW" "ABW" "ABW" ...
##  $ Year                                                : num [1:882] 2010 2015 2020 2022 2010 ...
##  $ location                                            : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ continent                                           : chr [1:882] "North America" "North America" "North America" "North America" ...
##  $ population                                          : num [1:882] 106459 106459 106459 106459 41128772 ...
##  $ ctyCode                                             : num [1:882] 533 533 533 533 4 4 4 4 24 24 ...
##  $ ISO_A2                                              : chr [1:882] "AW" "AW" "AW" "AW" ...
##  $ cty.Name.English                                    : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ cty.Fullname.English                                : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ Cty.Abbreviation                                    : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ Cty.Comments                                        : chr [1:882] "NULL" "NULL" "NULL" "NULL" ...
##  $ Start.Valid.Year                                    : num [1:882] 1988 1988 1988 1988 1962 ...
##  $ End.Valid.Year                                      : num [1:882] 2061 2061 2061 2061 2061 ...
##  $ country                                             : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ Population mid-year estimates (millions)            : num [1:882] 0.1 0.1 0.11 0.11 28.19 ...
##  $ Population mid-year estimates for males (millions)  : num [1:882] 0.05 0.05 0.05 0.05 14.24 ...
##  $ Population mid-year estimates for females (millions): num [1:882] 0.05 0.05 0.06 0.06 13.95 ...
##  $ Sex ratio (males per 100 females)                   : num [1:882] 91.2 90.2 89.1 89.2 102.1 ...
##  $ Population aged 0 to 14 years old (percentage)      : num [1:882] 20.7 18.8 17.4 16.2 49 45.8 43.8 43.1 45.7 45.8 ...
##  $ Population aged 60+ years old (percentage)          : num [1:882] 14.7 18.2 22.4 23.9 3.8 3.8 3.8 3.8 4.1 4.1 ...
##  $ Population density                                  : num [1:882] 557.5 579.2 592.1 591.4 43.4 ...
##  $ Surface area (thousand km2)                         : num [1:882] NA 0 0 NA NA ...

3.4 Initial exploratory analysis

Any data analysis must begin by understanding the content and organization of the data to be used. To do this, we will systematically address the following preliminary steps:

  1. Open the files with the data in text format or with Excel (if it is an Excel sheet) and analyze its content.
  2. Locate and understand the possible associated metadata. The metadata that, in general, accompany the data, provide us with very valuable information to understand the content of the data. This metadata may be in other files or included somewhere in the data file. Keep in mind that in this case, the metadata is lost when loading the data as a tibble and you have to look at it in the original data files. This happens, for example, in files in px format.
  3. Identify the types of the variables. If there is a variable with date information, identify the format. Study if it is necessary to coerce variables and transform the variable with date information.
  4. Identify if the data has a ‘tidy’ organization, that is, if each row includes a single observation accompanied by possible categorical variables (used to classify) or, on the contrary, each row contains multiple observations.
  5. Identify the range of categorical variables to understand the possible data classification items. For example, if the categorical variable is sex, identify what possible values this variable can take. To do this we can use the R functionality levels(as.factor(VariableName))
  6. Identify if there is an index variable that will allow us to combine the information from our data table with other tables.
  7. Analyze the absence of data ‘NA’ and identify how the absence of data is coded in the analyzed data. In the case of time series, which is data that changes over time, the absence of data may be due to missing intermediate dates. This in R can be detected using the lag function to compare a date variable with the same variable shifted by one value. Let’s look at an example:
date <-as.Date(c("2023-01-1","2023-01-2","2023-01-4","2023-01-5"))
date-lag(date)
## Time differences in days
## [1] NA  1  2  1

In this way, we can detect that dates are missing when the value of date-lag(date) is greater than 1. All this initial exploratory analysis is essential and must be done as a preliminary step to any subsequent data processing.

References

[Pe15] Roger D. Peng. R Programming for Data Science, Lulu, 2015.

[WiÇeGa23] Wickham, Hadley, Mine Çetinkaya-Rundel and Garrett Grolemund. R for Data Science (2e), O’Reilly Media, 2023.