3 Data processing
Installation/loading of used libraries
if (!require(openxlsx)) install.packages('openxlsx')
library(openxlsx)
if (!require(pxR)) install.packages('pxR')
library(pxR)
if (!require(tidyverse)) install.packages('tidyverse')
library(tidyverse)
Usually, we do not find data organized in the way we need for the purpose of our analysis. In fact, often the most complex job in our data exploration is locating, extracting, and understanding how the data is organized. Therefore, the first task of any data analysis is to process the data until we have it in the form and structure that we want. For this reason, we are going to dedicate some attention to data manipulation in this course.
First of all we will talk about the main databases that we will use in the course. As will be seen below, current databases on topics of general interest will mainly be used, which may be more attractive than the databases included by default in ‘R’ that are usually used in data analysis courses.
3.1 Databases
The main sources of information that we are going to use to illustrate the course contents are the following databases of general interest:
OWID: global database on current affairs
UNdata: database of the United Nations organization.
INE: database of the Spanish National Statistics Institute.
ISTAC : database of the Canarian Institute of Statistics
UNICEF : UNICEF database
World Health Organization. : World Health Organization database
Kaggle : Kaggle is a very popular data sharing platform. It has the advantage that it has a wide variety of all types of data, but caution must be taken because anyone can upload data and this can compromise its reliability. Kaggle assigns to each dataset a “Usability” value which can be interpreted as a data quality factor that includes the quality of the data sources. It is important to use data with a high value of this quality factor.
In this book we will also use some databases stored in a local data repository called data
. The files referenced in this book to this repository can be downloaded from https://ctim.es/AEDV/data/. These local databases have generally been extracted from the above databases and for convenience have been placed in a local repository.
In addition to these databases, R
has a large collection of directly accessible example databases. For example, the car information database mtcars
is widely used to illustrate the functionalities of R
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We will not use these databases incorporated in R
in this course.
3.2 Data file formats
To analyze data, the first thing you have to do is access it, which is generally done by reading the data from a file where it is stored as a table. Below we will see the most common data file formats and how to read them using the corresponding libraries. The reading libraries we use allow us to read the data directly from a WEB server, which is an added value. As a summary of this section it may be useful to look at the following summary sheet.
CSV format
The CSV format consists of plain text, that is, it contains the table values without any other information whether they are dates, numbers, characters, etc. It can be edited with any text editor (such as NOTEPAD) and the variables are separated from each other by a comma and decimals are expressed with points. To read CSV files we will use the read_csv
function of the readl
library included in the tidyverse
library. If the variables in the table are separated by the ;
symbol and also the decimals of the numbers are expressed with commas, we will use the read function read_csv2
. If the delimiter between variables is neither a comma nor a period, we can use the read_delim
function that allows us to declare any delimiter between variables. A more detailed description of these functions appears in the aforementioned summary sheet.
Next we will read a CSV file with information on the world population from the United Nations (UN) database. As often happens, data tables require manipulation to correct errors or select the variables that interest us. In this case, the first line of the table contains comments and in the second line, which contains the variable names, there is a variable that comes without a name. In later sections of this chapter we will see how to manipulate/transform the data. This database can be read directly from the United Nations server using the script:
UN_population <- read_csv('https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv',skip = 1)%>%
as_tibble()
For convenience, we have saved a copy of the file in our local repository and read it from that source. We add skip=1
to the read instruction to ignore the first line of the table that contains a comment, once we read the table we use the str
function to analyze its structure:
UN_population <- read_csv("https://ctim.es/AEDV/data/UN_population.csv",skip = 1)%>%
as_tibble()
str(UN_population)
## tibble [7,873 × 7] (S3: tbl_df/tbl/data.frame)
## $ Region/Country/Area: num [1:7873] 1 1 1 1 1 1 1 1 1 1 ...
## $ ...2 : chr [1:7873] "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
## $ Year : num [1:7873] 2010 2010 2010 2010 2010 ...
## $ Series : chr [1:7873] "Population mid-year estimates (millions)" "Population mid-year estimates for males (millions)" "Population mid-year estimates for females (millions)" "Sex ratio (males per 100 females)" ...
## $ Value : num [1:7873] 6985.6 3514.4 3471.2 101.2 27.1 ...
## $ Footnotes : chr [1:7873] NA NA NA NA ...
## $ Source : chr [1:7873] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022." "United Nations Population Division, New York, World Population Prospects: The 2022 Revision; supplemented by da"| __truncated__ ...
The actual content of some of the variables is often not very clear at first glance. Therefore, in exploratory data analysis, the first step is to analyze the content of the tables in some detail. To analyze the range of possible values of the variables, especially in the case of categorical variables (factor type), we can see the content of some of the variables using levels
and taking the variable as factor
so that we get the range of values. the possible values without repeating. For numerical variables we can use the `summary’ function to get a first idea of their range of values. For this first table we present an exhaustive analysis of the variables. In general, in this course, in order not to make the presentation of the tables used too cumbersome, we will not analyze the tables used in such detail, but it is essential that students acquire the habit of analyzing the tables in detail before starting to work with them. .
## [1] "1" "2" "4" "5" "8" "9" "11" "12" "13" "14" "15" "16"
## [13] "17" "18" "19" "20" "21" "24" "28" "29" "30" "31" "32" "34"
## [25] "35" "36" "39" "40" "44" "48" "50" "51" "52" "53" "54" "56"
## [37] "57" "60" "61" "62" "64" "68" "70" "72" "76" "84" "90" "92"
## [49] "96" "100" "104" "108" "112" "116" "120" "124" "132" "136" "140" "142"
## [61] "143" "144" "145" "148" "150" "151" "152" "154" "155" "156" "158" "170"
## [73] "174" "175" "178" "180" "184" "188" "191" "192" "196" "202" "203" "204"
## [85] "208" "212" "214" "218" "222" "226" "231" "232" "233" "234" "238" "242"
## [97] "246" "250" "254" "258" "262" "266" "268" "270" "275" "276" "288" "292"
## [109] "296" "300" "304" "308" "312" "316" "320" "324" "328" "332" "336" "340"
## [121] "344" "348" "352" "356" "360" "364" "368" "372" "376" "380" "384" "388"
## [133] "392" "398" "400" "404" "408" "410" "414" "417" "418" "419" "422" "426"
## [145] "428" "430" "434" "438" "440" "442" "446" "450" "454" "458" "462" "466"
## [157] "470" "474" "478" "480" "484" "492" "496" "498" "499" "500" "504" "508"
## [169] "512" "516" "520" "524" "528" "531" "533" "534" "535" "540" "548" "554"
## [181] "558" "562" "566" "570" "578" "580" "583" "584" "585" "586" "591" "598"
## [193] "600" "604" "608" "616" "620" "624" "626" "630" "634" "638" "642" "643"
## [205] "646" "652" "654" "659" "660" "662" "663" "666" "670" "674" "678" "682"
## [217] "686" "688" "690" "694" "702" "703" "704" "705" "706" "710" "716" "724"
## [229] "728" "729" "732" "740" "748" "752" "756" "760" "762" "764" "768" "772"
## [241] "776" "780" "784" "788" "792" "795" "796" "798" "800" "804" "807" "818"
## [253] "826" "830" "833" "834" "840" "850" "854" "858" "860" "862" "876" "882"
## [265] "887" "894"
The variable Region/Country/Area
is a numerical code that identifies the country. This code allows us to combine the values in this table with other country data tables that use the same code. We can manage this variable as numeric
or as factor
, since it is used to classify which country the data of each record belongs to.
## [1] "1" "2" "4" "5" "8" "9" "11" "12" "13" "14" "15" "16"
## [13] "17" "18" "19" "20" "21" "24" "28" "29" "30" "31" "32" "34"
## [25] "35" "36" "39" "40" "44" "48" "50" "51" "52" "53" "54" "56"
## [37] "57" "60" "61" "62" "64" "68" "70" "72" "76" "84" "90" "92"
## [49] "96" "100" "104" "108" "112" "116" "120" "124" "132" "136" "140" "142"
## [61] "143" "144" "145" "148" "150" "151" "152" "154" "155" "156" "158" "170"
## [73] "174" "175" "178" "180" "184" "188" "191" "192" "196" "202" "203" "204"
## [85] "208" "212" "214" "218" "222" "226" "231" "232" "233" "234" "238" "242"
## [97] "246" "250" "254" "258" "262" "266" "268" "270" "275" "276" "288" "292"
## [109] "296" "300" "304" "308" "312" "316" "320" "324" "328" "332" "336" "340"
## [121] "344" "348" "352" "356" "360" "364" "368" "372" "376" "380" "384" "388"
## [133] "392" "398" "400" "404" "408" "410" "414" "417" "418" "419" "422" "426"
## [145] "428" "430" "434" "438" "440" "442" "446" "450" "454" "458" "462" "466"
## [157] "470" "474" "478" "480" "484" "492" "496" "498" "499" "500" "504" "508"
## [169] "512" "516" "520" "524" "528" "531" "533" "534" "535" "540" "548" "554"
## [181] "558" "562" "566" "570" "578" "580" "583" "584" "585" "586" "591" "598"
## [193] "600" "604" "608" "616" "620" "624" "626" "630" "634" "638" "642" "643"
## [205] "646" "652" "654" "659" "660" "662" "663" "666" "670" "674" "678" "682"
## [217] "686" "688" "690" "694" "702" "703" "704" "705" "706" "710" "716" "724"
## [229] "728" "729" "732" "740" "748" "752" "756" "760" "762" "764" "768" "772"
## [241] "776" "780" "784" "788" "792" "795" "796" "798" "800" "804" "807" "818"
## [253] "826" "830" "833" "834" "840" "850" "854" "858" "860" "862" "876" "882"
## [265] "887" "894"
## [1] "Afghanistan" "Africa"
## [3] "Albania" "Algeria"
## [5] "American Samoa" "Americas"
## [7] "Andorra" "Angola"
## [9] "Anguilla" "Antigua and Barbuda"
## [11] "Argentina" "Armenia"
## [13] "Aruba" "Asia"
## [15] "Australia" "Australia and New Zealand"
## [17] "Austria" "Azerbaijan"
## [19] "Bahamas" "Bahrain"
## [21] "Bangladesh" "Barbados"
## [23] "Belarus" "Belgium"
## [25] "Belize" "Benin"
## [27] "Bermuda" "Bhutan"
## [29] "Bolivia (Plurin. State of)" "Bonaire, St. Eustatius & Saba"
## [31] "Bosnia and Herzegovina" "Botswana"
## [33] "Brazil" "British Virgin Islands"
## [35] "Brunei Darussalam" "Bulgaria"
## [37] "Burkina Faso" "Burundi"
## [39] "Cabo Verde" "Cambodia"
## [41] "Cameroon" "Canada"
## [43] "Caribbean" "Cayman Islands"
## [45] "Central African Republic" "Central America"
## [47] "Central Asia" "Chad"
## [49] "Channel Islands" "Chile"
## [51] "China" "China, Hong Kong SAR"
## [53] "China, Macao SAR" "Colombia"
## [55] "Comoros" "Congo"
## [57] "Cook Islands" "Costa Rica"
## [59] "Côte d’Ivoire" "Croatia"
## [61] "Cuba" "Curaçao"
## [63] "Cyprus" "Czechia"
## [65] "Dem. People's Rep. Korea" "Dem. Rep. of the Congo"
## [67] "Denmark" "Djibouti"
## [69] "Dominica" "Dominican Republic"
## [71] "Eastern Africa" "Eastern Asia"
## [73] "Eastern Europe" "Ecuador"
## [75] "Egypt" "El Salvador"
## [77] "Equatorial Guinea" "Eritrea"
## [79] "Estonia" "Eswatini"
## [81] "Ethiopia" "Europe"
## [83] "Falkland Islands (Malvinas)" "Faroe Islands"
## [85] "Fiji" "Finland"
## [87] "France" "French Guiana"
## [89] "French Polynesia" "Gabon"
## [91] "Gambia" "Georgia"
## [93] "Germany" "Ghana"
## [95] "Gibraltar" "Greece"
## [97] "Greenland" "Grenada"
## [99] "Guadeloupe" "Guam"
## [101] "Guatemala" "Guinea"
## [103] "Guinea-Bissau" "Guyana"
## [105] "Haiti" "Holy See"
## [107] "Honduras" "Hungary"
## [109] "Iceland" "India"
## [111] "Indonesia" "Iran (Islamic Republic of)"
## [113] "Iraq" "Ireland"
## [115] "Isle of Man" "Israel"
## [117] "Italy" "Jamaica"
## [119] "Japan" "Jordan"
## [121] "Kazakhstan" "Kenya"
## [123] "Kiribati" "Kuwait"
## [125] "Kyrgyzstan" "Lao People's Dem. Rep."
## [127] "Latin America & the Caribbean" "Latvia"
## [129] "Lebanon" "Lesotho"
## [131] "Liberia" "Libya"
## [133] "Liechtenstein" "Lithuania"
## [135] "Luxembourg" "Madagascar"
## [137] "Malawi" "Malaysia"
## [139] "Maldives" "Mali"
## [141] "Malta" "Marshall Islands"
## [143] "Martinique" "Mauritania"
## [145] "Mauritius" "Mayotte"
## [147] "Melanesia" "Mexico"
## [149] "Micronesia" "Micronesia (Fed. States of)"
## [151] "Middle Africa" "Monaco"
## [153] "Mongolia" "Montenegro"
## [155] "Montserrat" "Morocco"
## [157] "Mozambique" "Myanmar"
## [159] "Namibia" "Nauru"
## [161] "Nepal" "Netherlands"
## [163] "New Caledonia" "New Zealand"
## [165] "Nicaragua" "Niger"
## [167] "Nigeria" "Niue"
## [169] "North Macedonia" "Northern Africa"
## [171] "Northern America" "Northern Europe"
## [173] "Northern Mariana Islands" "Norway"
## [175] "Oceania" "Oman"
## [177] "Other non-specified areas" "Pakistan"
## [179] "Palau" "Panama"
## [181] "Papua New Guinea" "Paraguay"
## [183] "Peru" "Philippines"
## [185] "Poland" "Polynesia"
## [187] "Portugal" "Puerto Rico"
## [189] "Qatar" "Republic of Korea"
## [191] "Republic of Moldova" "Réunion"
## [193] "Romania" "Russian Federation"
## [195] "Rwanda" "Saint Barthélemy"
## [197] "Saint Helena" "Saint Kitts and Nevis"
## [199] "Saint Lucia" "Saint Martin (French part)"
## [201] "Saint Pierre and Miquelon" "Saint Vincent & Grenadines"
## [203] "Samoa" "San Marino"
## [205] "Sao Tome and Principe" "Saudi Arabia"
## [207] "Senegal" "Serbia"
## [209] "Seychelles" "Sierra Leone"
## [211] "Singapore" "Sint Maarten (Dutch part)"
## [213] "Slovakia" "Slovenia"
## [215] "Solomon Islands" "Somalia"
## [217] "South-central Asia" "South-eastern Asia"
## [219] "South Africa" "South America"
## [221] "South Sudan" "Southern Africa"
## [223] "Southern Asia" "Southern Europe"
## [225] "Spain" "Sri Lanka"
## [227] "State of Palestine" "Sub-Saharan Africa"
## [229] "Sudan" "Suriname"
## [231] "Sweden" "Switzerland"
## [233] "Syrian Arab Republic" "Tajikistan"
## [235] "Thailand" "Timor-Leste"
## [237] "Togo" "Tokelau"
## [239] "Tonga" "Total, all countries or areas"
## [241] "Trinidad and Tobago" "Tunisia"
## [243] "Türkiye" "Turkmenistan"
## [245] "Turks and Caicos Islands" "Tuvalu"
## [247] "Uganda" "Ukraine"
## [249] "United Arab Emirates" "United Kingdom"
## [251] "United Rep. of Tanzania" "United States of America"
## [253] "United States Virgin Islands" "Uruguay"
## [255] "Uzbekistan" "Vanuatu"
## [257] "Venezuela (Boliv. Rep. of)" "Viet Nam"
## [259] "Wallis and Futuna Islands" "Western Africa"
## [261] "Western Asia" "Western Europe"
## [263] "Western Sahara" "Yemen"
## [265] "Zambia" "Zimbabwe"
The variable ...2
is the name of the country and is of type character
.
## [1] "Population aged 0 to 14 years old (percentage)"
## [2] "Population aged 60+ years old (percentage)"
## [3] "Population density"
## [4] "Population mid-year estimates (millions)"
## [5] "Population mid-year estimates for females (millions)"
## [6] "Population mid-year estimates for males (millions)"
## [7] "Sex ratio (males per 100 females)"
## [8] "Surface area (thousand km2)"
The variable ‘Series’ describes the name of the indicator associated with the data in each record, that is, it indicates what is being measured in each record. We can manage it as character
or as factor
since it is also used to classify records. It is often not clear from the name of the indicator what its exact meaning is. In that case, you must look at additional information in the table where the meaning of the indicators used is explained in more detail. This is very important to be able to interpret and understand the data we handle.
## [1] "2010" "2015" "2017" "2020" "2022"
The Year
variable determines the year associated with the data in each record. Usually we will convert temporary data into Date
type, in this case, for example, we can associate the last day of the year.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.00 21.52 255.80 94.70 136162.00
The variable Value
, of type numeric
, supplies the numerical value of the data in each record.
## [1] "A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas)."
## [2] "A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas). For statistical purposes, the data for United Kingdom do not include this area."
## [3] "Calculated by the UN Statistics Division."
## [4] "Calculated by the UN Statistics Division.;Projected estimate (medium fertility variant)."
## [5] "Changes in total area per year are the result of new measuring and correcting of the administrative borders between former Yugoslavian countries."
## [6] "Comprising the Northern Region (former Saguia el Hamra) and Southern Region (former Rio de Oro)."
## [7] "Data refer to 1 October 2007."
## [8] "Data refer to the Vatican City State."
## [9] "Data updated according to \"Superintendencia Agraria\". Interior waters correspond to natural or artificial bodies of water or snow."
## [10] "Excluding Åland Islands."
## [11] "Excluding Channel Islands (Guernsey and Jersey) and Isle of Man, shown separately, if available."
## [12] "Excluding Faeroe Islands and Greenland shown separately, if available."
## [13] "Excluding inland water."
## [14] "Excluding Kosovo."
## [15] "Excluding Niue, shown separately, which is part of Cook Islands, but because of remoteness is administered separately."
## [16] "Excluding the islands of Saint Brandon and Agalega."
## [17] "For statistical purposes, the data for China do not include this area."
## [18] "For statistical purposes, the data for China do not include this area.;Projected estimate (medium fertility variant)."
## [19] "For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR) and Taiwan Province of China."
## [20] "For statistical purposes, the data for Denmark do not include Faroe Islands, and Greenland."
## [21] "For statistical purposes, the data for Denmark do not include this area."
## [22] "For statistical purposes, the data for France do not include French Guiana, French Polynesia, Guadeloupe, Martinique, Mayotte, New Caledonia, Réunion, Saint Pierre and Miquelon, Saint Barthélemy, Saint Martin (French part), Wallis and Futuna Islands."
## [23] "For statistical purposes, the data for France do not include this area."
## [24] "For statistical purposes, the data for France do not include this area.;Projected estimate (medium fertility variant)."
## [25] "For statistical purposes, the data for Netherlands do not include Aruba, Bonaire, Sint Eustatius and Saba, Curaçao, and Sint Maarten (Dutch part)."
## [26] "For statistical purposes, the data for Netherlands do not include this area."
## [27] "For statistical purposes, the data for Netherlands do not include this area.;Projected estimate (medium fertility variant)."
## [28] "For statistical purposes, the data for New Zealand do not include Cook Islands, Niue, and Tokelau."
## [29] "For statistical purposes, the data for New Zealand do not include this area."
## [30] "For statistical purposes, the data for United Kingdom do not include this area."
## [31] "For statistical purposes, the data for United States of America do not include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and United States Virgin Islands."
## [32] "For statistical purposes, the data for United States of America do not include this area."
## [33] "Including Abkhazia and South Ossetia."
## [34] "Including Agalega, Rodrigues and Saint Brandon."
## [35] "Including Åland Islands."
## [36] "Including Andorra, Gibraltar, Holy See, and San Marino."
## [37] "Including Anguilla, Bonaire, Sint Eustatius and Saba, British Virgin Islands, Cayman Islands, Dominica, Montserrat, Saint Kitts and Nevis, Sint Maarten (Dutch part) and Turks and Caicos Islands."
## [38] "Including Ascension and Tristan da Cunha."
## [39] "Including Ascension and Tristan da Cunha. For statistical purposes, the data for United Kingdom do not include this area."
## [40] "Including Ascension and Tristan da Cunha.;St. Helena Island has no substantial natural inland waters however there are 15 reservoirs and similar open water storage features on island."
## [41] "Including Bermuda, Greenland, and Saint Pierre and Miquelon."
## [42] "Including Canary Islands, Ceuta and Melilla."
## [43] "Including Christmas Island, Cocos (Keeling) Islands and Norfolk Island."
## [44] "Including Crimea."
## [45] "Including East Jerusalem."
## [46] "Including Falkland Islands (Malvinas)."
## [47] "Including Liechtenstein and Monaco."
## [48] "Including low water level for all islands (area to shoreline)."
## [49] "Including Marshall Islands, Nauru, Northern Mariana Islands and Palau."
## [50] "Including Nagorno-Karabakh."
## [51] "Including Norfolk Island."
## [52] "Including Pitcairn."
## [53] "Including Sabah and Sarawak."
## [54] "Including Saint Helena."
## [55] "Including Svalbard and Jan Mayen Islands."
## [56] "Including the Faroe Islands and the Isle of Man."
## [57] "Including the Transnistria region."
## [58] "Including water bodies of lake Vaihiria, lake Temae and the Maiao lagoons, but not lake Maeva and the lagoons of Raiatea and Tahaa from the upper islands. No lagoons from lower islands are included."
## [59] "Including Zanzibar."
## [60] "Inland waters include the reservoirs."
## [61] "Land area only."
## [62] "Land area only. Excluding 84 square km of uninhabited islands."
## [63] "Projected estimate (medium fertility variant)."
## [64] "Projected estimate (medium fertility variant).;A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas). For statistical purposes, the data for United Kingdom do not include this area."
## [65] "Projected estimate (medium fertility variant).;Data refer to the Vatican City State."
## [66] "Projected estimate (medium fertility variant).;Excluding Kosovo."
## [67] "Projected estimate (medium fertility variant).;For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR) and Taiwan Province of China."
## [68] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Denmark do not include Faroe Islands, and Greenland."
## [69] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Denmark do not include this area."
## [70] "Projected estimate (medium fertility variant).;For statistical purposes, the data for France do not include French Guiana, French Polynesia, Guadeloupe, Martinique, Mayotte, New Caledonia, Réunion, Saint Pierre and Miquelon, Saint Barthélemy, Saint Martin (French part), Wallis and Futuna Islands."
## [71] "Projected estimate (medium fertility variant).;For statistical purposes, the data for France do not include this area."
## [72] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Netherlands do not include Aruba, Bonaire, Sint Eustatius and Saba, Curaçao, and Sint Maarten (Dutch part)."
## [73] "Projected estimate (medium fertility variant).;For statistical purposes, the data for Netherlands do not include this area."
## [74] "Projected estimate (medium fertility variant).;For statistical purposes, the data for New Zealand do not include Cook Islands, Niue, and Tokelau."
## [75] "Projected estimate (medium fertility variant).;For statistical purposes, the data for New Zealand do not include this area."
## [76] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United Kingdom do not include this area."
## [77] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United States of America do not include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and United States Virgin Islands."
## [78] "Projected estimate (medium fertility variant).;For statistical purposes, the data for United States of America do not include this area."
## [79] "Projected estimate (medium fertility variant).;Including Abkhazia and South Ossetia."
## [80] "Projected estimate (medium fertility variant).;Including Agalega, Rodrigues and Saint Brandon."
## [81] "Projected estimate (medium fertility variant).;Including Åland Islands."
## [82] "Projected estimate (medium fertility variant).;Including Andorra, Gibraltar, Holy See, and San Marino."
## [83] "Projected estimate (medium fertility variant).;Including Anguilla, Bonaire, Sint Eustatius and Saba, British Virgin Islands, Cayman Islands, Dominica, Montserrat, Saint Kitts and Nevis, Sint Maarten (Dutch part) and Turks and Caicos Islands."
## [84] "Projected estimate (medium fertility variant).;Including Ascension and Tristan da Cunha. For statistical purposes, the data for United Kingdom do not include this area."
## [85] "Projected estimate (medium fertility variant).;Including Bermuda, Greenland, and Saint Pierre and Miquelon."
## [86] "Projected estimate (medium fertility variant).;Including Canary Islands, Ceuta and Melilla."
## [87] "Projected estimate (medium fertility variant).;Including Christmas Island, Cocos (Keeling) Islands and Norfolk Island."
## [88] "Projected estimate (medium fertility variant).;Including Crimea."
## [89] "Projected estimate (medium fertility variant).;Including East Jerusalem."
## [90] "Projected estimate (medium fertility variant).;Including Falkland Islands (Malvinas)."
## [91] "Projected estimate (medium fertility variant).;Including Liechtenstein and Monaco."
## [92] "Projected estimate (medium fertility variant).;Including Marshall Islands, Nauru, Northern Mariana Islands and Palau."
## [93] "Projected estimate (medium fertility variant).;Including Nagorno-Karabakh."
## [94] "Projected estimate (medium fertility variant).;Including Pitcairn."
## [95] "Projected estimate (medium fertility variant).;Including Sabah and Sarawak."
## [96] "Projected estimate (medium fertility variant).;Including Saint Helena."
## [97] "Projected estimate (medium fertility variant).;Including Svalbard and Jan Mayen Islands."
## [98] "Projected estimate (medium fertility variant).;Including the Faroe Islands and the Isle of Man."
## [99] "Projected estimate (medium fertility variant).;Including the Transnistria region."
## [100] "Projected estimate (medium fertility variant).;Including Zanzibar."
## [101] "Projected estimate (medium fertility variant).;Refers to the United Kingdom of Great Britain and Northern Ireland. For statistical purposes, the data for United Kingdom do not include Anguilla, Bermuda, British Virgin Islands, Cayman Islands, Channel Islands, Falkland Islands (Malvinas), Gibraltar, Isle of Man, Montserrat, Saint Helena, Turks and Caicos Islands."
## [102] "Projected estimate (medium fertility variant).;Refers to the whole country."
## [103] "Refers to Guernsey and Jersey."
## [104] "Refers to habitable area. Excludes Saint Lucia's Forest Reserve."
## [105] "Refers to the United Kingdom of Great Britain and Northern Ireland. For statistical purposes, the data for United Kingdom do not include Anguilla, Bermuda, British Virgin Islands, Cayman Islands, Channel Islands, Falkland Islands (Malvinas), Gibraltar, Isle of Man, Montserrat, Saint Helena, Turks and Caicos Islands."
## [106] "Refers to the whole country."
## [107] "Surface area is 0.44 Km2."
## [108] "Surface area is based on the 2002 population and housing census."
## [109] "The land area of Singapore comprises the mainland and other islands."
## [110] "The total area includes continental areas and islands, and excludes Antarctic area."
## [111] "The total surface is 21 040.79 square kilometres, without taking into account the last ruling of The Hague."
The variable Footnotes
is of type character
and includes possible observations associated with some data.
## [1] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022."
## [2] "United Nations Population Division, New York, World Population Prospects: The 2022 Revision; supplemented by data from the United Nations Statistics Division, New York, Demographic Yearbook 2020 and Secretariat for the Pacific Community (SPC) for small countries or areas, last accessed July 2022."
## [3] "United Nations Statistics Division, New York, \"Demographic Yearbook 2020\" and the demographic statistics database, last accessed June 2022."
The source
variable determines the name of the source from which the data for each record was obtained.
We conclude that the table gives us numerical data, by country and time period, on some demographic aspects and its surface area. Information is given about 5 specific time periods. Depending on what we want to do, we may have to do a coercion to change the type of some variables such as the year to type Date
, or the variables used to clasify to type factor
.
EXCEL format
To read Excel spreadsheets we will use the openxlsx
library that allows us to read files hosted locally or on a web server. As an example we are going to read data published by OWID. We can read the OWID data directly on your web server using a script like the following:
df <- read.xlsx("https://nyc3.digitaloceanspaces.com/owid-public/data/co2/owid-co2-data.xlsx",sheet=1)
However, as the database is very large and takes a long time to read over the Internet, we have saved some simplified data on population by country in our local data
repository.
This table is organized in a simpler way than the UN_population
table and simply with the str()
function we can identify how it is organized and its content:
## tibble [237 × 17] (S3: tbl_df/tbl/data.frame)
## $ iso_code : chr [1:237] "ABW" "AFG" "AGO" "AIA" ...
## $ location : chr [1:237] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
## $ continent : chr [1:237] "North America" "Asia" "Africa" "North America" ...
## $ population : num [1:237] 106459 41128772 35588996 15877 2842318 ...
## $ median_age : num [1:237] 41.2 18.6 16.8 NA 38 NA 34 31.9 35.7 NA ...
## $ life_expectancy : num [1:237] 76.3 64.8 61.1 81.9 78.6 ...
## $ aged_65_older : num [1:237] 13.09 2.58 2.4 NA 13.19 ...
## $ aged_70_older : num [1:237] 7.45 1.34 1.36 NA 8.64 ...
## $ gdp_per_capit : num [1:237] 35974 1804 5819 NA 11803 ...
## $ extreme_poverty : num [1:237] NA NA NA NA 1.1 NA NA 0.6 1.8 NA ...
## $ cardiovasc_death_rat : num [1:237] NA 597 276 NA 304 ...
## $ diabetes_prevalence : num [1:237] 11.62 9.59 3.94 NA 10.08 ...
## $ female_smokers : num [1:237] NA NA NA NA 7.1 29 1.2 16.2 1.5 NA ...
## $ male_smokers : num [1:237] NA NA NA NA 51.2 37.8 37.4 27.7 52.1 NA ...
## $ handwashing_facilities : num [1:237] NA 37.7 26.7 NA NA ...
## $ hospital_beds_per_thousand: num [1:237] NA 0.5 NA NA 2.89 NA 1.2 5 4.2 NA ...
## $ human_development_index : num [1:237] NA 0.511 0.581 NA 0.795 0.868 0.89 0.845 0.776 NA ...
We conclude that this table gives us information by country on some indicators. The country identification code (variable iso_code
) is 3 letters, instead of 3 numbers as used in the previous UN table. This represents a problem that will have to be solved if we want to combine the two tables. Note that the organization of this table and the previous one is very different. In this one, each row of the table contains the information of all the country’s indicators, each of them corresponds to a different variable, and in the previous table, each row contains the data of a single indicator, and the names of the indicators are in another variable. This last organization (the one in table UN) is called tidy data
and we will talk about it in more detail at the end of this chapter.
PC-Axis format
The PC-Axis format, specialized in storing statistical data, is the one used by INE and ISTAD to export data. It can be edited with any plain text editor (such as NOTEPAD). To read this file format we use the read.px
function of the pxR
library. Next we are going to read population data by municipalities of the Canary Islands previously stored in the local data repository data
based on a query made in ISTAD.
istac_population <- read.px("https://ctim.es/AEDV/data/PoblacionMunicipiosCanarios.px")%>%
as_tibble()
istac_population %>%
print(n=5)
## # A tibble: 2,112 × 5
## Sexos Años Edades.año.a.año Municipios.por.islas value
## <fct> <fct> <fct> <fct> <dbl>
## 1 AMBOS SEXOS 2021 TOTAL CANARIAS 2172944
## 2 AMBOS SEXOS 2020 TOTAL CANARIAS 2175952
## 3 AMBOS SEXOS 2019 TOTAL CANARIAS 2153389
## 4 AMBOS SEXOS 2018 TOTAL CANARIAS 2127685
## 5 AMBOS SEXOS 2017 TOTAL CANARIAS 2108121
## # ℹ 2,107 more rows
## [1] "AMBOS SEXOS"
## [1] "2021" "2020" "2019" "2018" "2017" "2016" "2015" "2014" "2013" "2012"
## [11] "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003" "2002"
## [21] "2001" "2000"
## [1] "TOTAL"
## [1] "CANARIAS" "LANZAROTE"
## [3] "Arrecife" "Haría"
## [5] "San Bartolomé" "Teguise"
## [7] "Tías" "Tinajo"
## [9] "Yaiza" "FUERTEVENTURA"
## [11] "Antigua" "Betancuria"
## [13] "Oliva (La)" "Pájara"
## [15] "Puerto del Rosario" "Tuineje"
## [17] "GRAN CANARIA" "Agaete"
## [19] "Agüimes" "Artenara"
## [21] "Arucas" "Firgas"
## [23] "Gáldar" "Ingenio"
## [25] "Mogán" "Moya"
## [27] "Palmas de Gran Canaria (Las)" "San Bartolomé de Tirajana"
## [29] "Aldea de San Nicolás (La)" "Santa Brígida"
## [31] "Santa Lucía" "Santa María de Guía"
## [33] "Tejeda" "Telde"
## [35] "Teror" "Valsequillo"
## [37] "Valleseco" "Vega de San Mateo"
## [39] "TENERIFE" "Adeje"
## [41] "Arafo" "Arico"
## [43] "Arona" "Buenavista del Norte"
## [45] "Candelaria" "Fasnia"
## [47] "Garachico" "Granadilla de Abona"
## [49] "Guancha (La)" "Guía de Isora"
## [51] "Güimar" "Icod de Los Vinos"
## [53] "Laguna (La)" "Matanza de Acentejo (La)"
## [55] "Orotava (La)" "Puerto de La Cruz"
## [57] "Realejos (Los)" "Rosario (El)"
## [59] "San Juan de La Rambla" "San Miguel"
## [61] "Santa Cruz de Tenerife" "Santa Úrsula"
## [63] "Santiago del Teide" "Sauzal (El)"
## [65] "Silos (Los)" "Tacoronte"
## [67] "Tanque (El)" "Tegueste"
## [69] "Victoria de Acentejo (La)" "Vilaflor"
## [71] "LA GOMERA" "Agulo"
## [73] "Alajeró" "Hermigua"
## [75] "San Sebastián de La Gomera" "Valle Gran Rey"
## [77] "Vallehermoso" "LA PALMA"
## [79] "Barlovento" "Breña Alta"
## [81] "Breña Baja" "Fuencaliente"
## [83] "Garafía" "Llanos de Aridane (Los)"
## [85] "Paso (El)" "Puntagorda"
## [87] "Puntallana" "San Andrés y Sauces"
## [89] "Santa Cruz de La Palma" "Tazacorte"
## [91] "Tijarafe" "Villa de Mazo"
## [93] "EL HIERRO" "Frontera"
## [95] "Valverde" "Pinar de El Hierro (El)"
We conclude that this table gives us information about the population of the municipalities of the Canary Islands between the years 2000 and 2021, including both sexes and adding all ages. No identifying code of the municipality appears.
The INE and ISTAD databases are obtained from a query made on their web server, and it is the query that generates the file with the data. We can edit the file in text mode to see some metadata about the table that is lost when loading it in a tibble
. If reading the PC-Axis files fails, or generates problems, we can always download the query in another file format, for example an Excel sheet, and try to load the data from the Excel sheet.
Text format
Sometimes, when the data is provided by a text file, it is not organized as a table and reading the file using the functions previously seen can give rise to errors. In that case we have to correct and update the file before loading it as a table. This can be done by opening the file with a text editor and manipulating it manually. An alternative, and automatic, way to do it from R
is to use the read_lines
function that loads the text file as a vector containing the lines of the file and does not assume that it has a table format with identifiable variables and records. We can edit this vector and write it back updated using the write_lines
function.
UN_population2 <- read_lines("https://ctim.es/AEDV/data/UN_population.csv")
# remove the first line
UN_population2 <- UN_population2[-1]
# replace the string "Region/Country/Area" with "location"
UN_population2 <- gsub('Region/Country/Area','location',UN_population2)
# we eliminate all the lines that contain the string "Total"
UN_population2 <- UN_population2[str_detect(UN_population2,"Total")==FALSE]
# write the updated file
write_lines(UN_population2,"UN_population2.csv")
ZIP format
The readl
library allows us to manage ZIP compressed files stored locally (not hosted on a server). Let’s look at an example with data from the file “data/UNCTAD.zip” obtained from UNCTAD, an organization linked to the United Nations dedicated to Trade and Development. First let’s see what documents this zip file includes:
## Name
## 1 US_GovExpenditures_ST202306281615_v1.csv
## 2 Glossary_Classification of the functions of government (COFOG) - Statistics Explained.pdf
## 3 UNCTADstat - Productive Capacities Index.pdf
## 4 WordlOfDebt_data.csv
## 5 Un mundo endeudado _ UNCTAD.pdf
## 6 US_PCI_20230704063200.csv
## Length Date
## 1 1890332 2023-09-18 11:25:00
## 2 670903 2023-09-01 10:56:00
## 3 2446315 2023-09-01 10:50:00
## 4 4248979 2023-09-01 11:19:00
## 5 659704 2023-09-01 11:25:00
## 6 3717173 2023-07-18 17:38:00
We note that this zip file includes 3 CSV files with data and 3 explanatory PDF documents. Let’s open one of the data files:
WordlOfDebt_data <- read_csv(unzip("../data/UNCTAD.zip", "WordlOfDebt_data.csv")) %>%
as_tibble()
str(WordlOfDebt_data)
## tibble [28,455 × 10] (S3: tbl_df/tbl/data.frame)
## $ ID : chr [1:28455] "AF" "AL" "DZ" "AO" ...
## $ Name : chr [1:28455] "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Region : chr [1:28455] "Developing Asia and Oceania" "Europe and Central Asia*" "Africa" "Africa" ...
## $ World Bank income group: chr [1:28455] "Low income countries" "Upper middle income countries" "Lower middle income countries" "Lower middle income countries" ...
## $ Development status : chr [1:28455] "Developing countries" "Developed countries" "Developing countries" "Developing countries" ...
## $ LDC : num [1:28455] 1 0 0 1 0 0 0 0 0 0 ...
## $ SIDS : num [1:28455] 0 0 0 0 1 0 0 0 0 0 ...
## $ Indicator : chr [1:28455] "Public debt as a share of GDP" "Public debt as a share of GDP" "Public debt as a share of GDP" "Public debt as a share of GDP" ...
## $ Year : chr [1:28455] "2010" "2010" "2010" "2010" ...
## $ Value : num [1:28455] 0.077 0.577 0.105 0.372 0.897 ...
## [1] "Africa" "Developing Asia and Oceania"
## [3] "Europe and Central Asia*" "Latin America and the Caribbean"
## [1] "High income countries" "Low income countries"
## [3] "Lower middle income countries" "Upper middle income countries"
## [1] "Developed countries" "Developing countries"
## [1] "Bilateral creditors as a share of external public debt"
## [2] "External public debt as a share of GDP"
## [3] "External public debt in US$ billions"
## [4] "External public debt in US$ per capita"
## [5] "Multilateral creditors as a share of external public debt"
## [6] "Private creditors as a share of external public debt"
## [7] "Public debt as a share of GDP"
## [8] "Public debt in US$ billions"
## [9] "Public debt in US$ per capita"
## [10] "Public debt interest payments as a share of GDP"
## [11] "Public debt interest payments as a share of revenues"
## [12] "Public debt interest payments in US$ per capita"
## [13] "Public education expenditure as a share of GDP"
## [14] "Public health expenditure as share of GDP"
## [15] "Public investment expenditure as a share of GDP"
## [16] "Ratio of public interest payments to education expenditure"
## [17] "Ratio of public interest payments to health expenditure"
## [18] "Ratio of public interest payments to investment expenditure"
## [1] "2010" "2010-2012" "2011" "2012" "2013" "2014"
## [7] "2014-2016" "2015" "2016" "2017" "2018" "2019"
## [13] "2019-2021" "2020" "2021" "2022"
We conclude that this table provides information by country and year (sometimes grouped) on some indicators on their debt. Additionally, countries are classified by their income and level of development.
3.3 Data transformation
The dplyr
package
The dplyr
package, found within the tidyverse
library, is commonly used to manage/transform data. As a summary of this section it may be useful to look at the following summary sheet. We will see, below, the most important functionalities of this package.
filter()
Filters table rows based on conditions. For example, with the following instruction we filter the countries in Europe with a population greater than 40 million inhabitants
## # A tibble: 6 × 17
## iso_code location continent population median_age life_expectancy
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 DEU Germany Europe 83369840 46.6 81.3
## 2 ESP Spain Europe 47558632 45.5 83.6
## 3 FRA France Europe 67813000 42 82.7
## 4 GBR United Kingdom Europe 67508936 40.8 81.3
## 5 ITA Italy Europe 59037472 47.9 83.5
## 6 RUS Russia Europe 144713312 39.6 72.6
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>
With the str_detect
function we can filter the table based on the content of a string variable. For example, below we filter the countries in the world whose name contains the string “ma” (case-insensitive)
## # A tibble: 29 × 17
## iso_code location continent population median_age life_expectancy
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 BHS Bahamas North Am… 409989 34.3 73.9
## 2 CYM Cayman Islands North Am… 68722 NA 83.9
## 3 DEU Germany Europe 83369840 46.6 81.3
## 4 DNK Denmark Europe 5882259 42.3 80.9
## 5 GTM Guatemala North Am… 17843914 22.9 74.3
## 6 IMN Isle of Man Europe 84534 NA 81.4
## 7 JAM Jamaica North Am… 2827382 31.4 74.5
## 8 MAC Macao Asia 695180 39.2 84.2
## 9 MAF Saint Martin (Frenc… North Am… 31816 NA 82.1
## 10 MDG Madagascar Africa 29611718 19.6 67.0
## # ℹ 19 more rows
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>
select()
Select some variables from the table. Fields can be selected individually
or by groups using the :
symbol.
owid_country %>%
filter(continent=="Europe" & population>4e7) %>%
arrange(desc(population)) %>%
select(location,life_expectancy:gdp_per_capit)
## # A tibble: 6 × 5
## location life_expectancy aged_65_older aged_70_older gdp_per_capit
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Russia 72.6 14.2 9.39 24766.
## 2 Germany 81.3 21.5 16.0 45229.
## 3 France 82.7 19.7 13.1 38606.
## 4 United Kingdom 81.3 18.5 12.5 39753.
## 5 Italy 83.5 23.0 16.2 35220.
## 6 Spain 83.6 19.4 13.8 34272.
select
is also used to remove variables from the table by putting in front of the variable
the -
sign
owid_country %>%
filter(continent=="Europe" & population>4e7) %>%
arrange(desc(population)) %>%
select(-iso_code,-continent)
## # A tibble: 6 × 15
## location population median_age life_expectancy aged_65_older aged_70_older
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Russia 144713312 39.6 72.6 14.2 9.39
## 2 Germany 83369840 46.6 81.3 21.5 16.0
## 3 France 67813000 42 82.7 19.7 13.1
## 4 United King… 67508936 40.8 81.3 18.5 12.5
## 5 Italy 59037472 47.9 83.5 23.0 16.2
## 6 Spain 47558632 45.5 83.6 19.4 13.8
## # ℹ 9 more variables: gdp_per_capit <dbl>, extreme_poverty <dbl>,
## # cardiovasc_death_rat <dbl>, diabetes_prevalence <dbl>,
## # female_smokers <dbl>, male_smokers <dbl>, handwashing_facilities <dbl>,
## # hospital_beds_per_thousand <dbl>, human_development_index <dbl>
arrange()
Sort the table based on one or more variables
owid_country %>%
filter(continent=="Europe" & population>4e7) %>%
arrange(desc(population)) # desc() is used to sort in descending order.
## # A tibble: 6 × 17
## iso_code location continent population median_age life_expectancy
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 RUS Russia Europe 144713312 39.6 72.6
## 2 DEU Germany Europe 83369840 46.6 81.3
## 3 FRA France Europe 67813000 42 82.7
## 4 GBR United Kingdom Europe 67508936 40.8 81.3
## 5 ITA Italy Europe 59037472 47.9 83.5
## 6 ESP Spain Europe 47558632 45.5 83.6
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>
mutate()
Calculates new variables or modifies existing one:
owid_country %>%
mutate( gdp = gdp_per_capit * population) %>%
select(location,population,gdp) %>%
arrange(desc(gdp)) %>%
head(10) %>%
knitr::kable( caption = "Countries sorted by GDP", format = "html", table.attr = "style='width:60%;'")
location | population | gdp |
---|---|---|
China | 1425887360 | 2.182850e+13 |
United States | 338289856 | 1.834392e+13 |
India | 1417173120 | 9.107710e+12 |
Japan | 123951696 | 4.834392e+12 |
Germany | 83369840 | 3.770755e+12 |
Russia | 144713312 | 3.583963e+12 |
Indonesia | 275501344 | 3.082514e+12 |
Brazil | 215313504 | 3.036664e+12 |
United Kingdom | 67508936 | 2.683699e+12 |
France | 67813000 | 2.617966e+12 |
relocate()
Change the order of the columns
owid_country %>%
relocate(location) %>% # puts location first
relocate(continent,.after=location) %>%
relocate(life_expectancy,.after=continent) %>%
print(n=5)
## # A tibble: 237 × 17
## location continent life_expectancy iso_code population median_age
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Aruba North America 76.3 ABW 106459 41.2
## 2 Afghanistan Asia 64.8 AFG 41128772 18.6
## 3 Angola Africa 61.2 AGO 35588996 16.8
## 4 Anguilla North America 81.9 AIA 15877 NA
## 5 Albania Europe 78.6 ALB 2842318 38
## # ℹ 232 more rows
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>
rename()
Change the name of the variables
owid_country %>%
filter(continent=="Europe" & population>4e7) %>%
arrange(desc(population)) %>%
select(location,population,median_age) %>%
rename(country = location)%>%
knitr::kable( caption = "European countries sorted by population", format = "html", table.attr = "style='width:60%;'")
country | population | median_age |
---|---|---|
Russia | 144713312 | 39.6 |
Germany | 83369840 | 46.6 |
France | 67813000 | 42.0 |
United Kingdom | 67508936 | 40.8 |
Italy | 59037472 | 47.9 |
Spain | 47558632 | 45.5 |
str_replace
This function replaces one string with another within a vector of strings
## # A tibble: 20 × 17
## iso_code location continent population median_age life_expectancy
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 TWN Taiwan Asia 23893396 42.2 80.5
## 2 TZA Tanzania Africa 65497752 17.7 65.5
## 3 UGA Uganda Africa 47249588 16.4 63.4
## 4 UKR Ukraine Europe 39701744 41.4 72.1
## 5 URY Uruguay South Am… 3422796 35.6 77.9
## 6 USA USA North Am… 338289856 38.3 78.9
## 7 UZB Uzbekistan Asia 34627648 28.2 71.7
## 8 VAT Vatican Europe 808 NA 75.1
## 9 VCT Saint Vincent and t… North Am… 103959 31.8 72.5
## 10 VEN Venezuela South Am… 28301700 29 72.1
## 11 VGB British Virgin Isla… North Am… 31332 NA 79.1
## 12 VIR USA Virgin Islands North Am… 99479 42.2 80.6
## 13 VNM Vietnam Asia 98186856 32.6 75.4
## 14 VUT Vanuatu Oceania 326744 23.1 70.5
## 15 WLF Wallis and Futuna Oceania 11596 NA 79.9
## 16 WSM Samoa Oceania 222390 22 73.3
## 17 YEM Yemen Asia 33696612 20.3 66.1
## 18 ZAF South Africa Africa 59893884 27.3 64.1
## 19 ZMB Zambia Africa 20017670 17.7 63.9
## 20 ZWE Zimbabwe Africa 16320539 19.6 61.5
## # ℹ 11 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>
To change a value in the entire tibble
we can do:
owid_country[owid_country=="United States"] <- "USA"
owid_country%>%
select(iso_code,location)%>%
filter(location=="USA")
## # A tibble: 1 × 2
## iso_code location
## <chr> <chr>
## 1 USA USA
group_by() and summarise()
group_by
groups the records by one or more variables and later summarise
allows us to perform operations on the groupings obtained. The variable on which the grouping is performed always appears in the result of summarise
. In the following example, GDP per inhabitant is calculated by grouping it by continent. That is, the data for each country is associated with the group that shares the same continent.
owid_country %>%
group_by(continent) %>%
summarise(
populationContinent=sum(population),
gdp_per_capita=sum(na.omit(gdp_per_capit*population))/populationContinent
) %>%
arrange(desc(gdp_per_capita)) %>%
knitr::kable( caption = "GDP per capita", format = "html", table.attr = "style='width:40%;'")
continent | populationContinent | gdp_per_capita |
---|---|---|
North America | 600323657 | 38468.345 |
Europe | 745629155 | 32792.373 |
Oceania | 45038907 | 31272.576 |
South America | 436816679 | 14519.141 |
Asia | 4721455390 | 12148.167 |
Africa | 1426736614 | 4544.862 |
In this case the sum
function has been used so that the result is the sum of the values of the countries of each continent, other functions can be used such as min
, max
, mean
, median
, sd
, first
, last
, prod
or n
(number of values in the vector).
Combining data from multiple tables
Frequently you have to combine information from several tables. The join type we will use here is the “left_join” which keeps all the records in the first table joining them with those in the second table. As a first example, we are going to combine two OWID tables using iso_code
as a common variable, which is a 3-letter code that identifies each country. To do this we will use the left_join
function from the dplyr
library.
## # A tibble: 41,294 × 95
## iso_code location continent population.x median_age life_expectancy
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ABW Aruba North America 106459 41.2 76.3
## 2 ABW Aruba North America 106459 41.2 76.3
## 3 ABW Aruba North America 106459 41.2 76.3
## 4 ABW Aruba North America 106459 41.2 76.3
## 5 ABW Aruba North America 106459 41.2 76.3
## 6 ABW Aruba North America 106459 41.2 76.3
## 7 ABW Aruba North America 106459 41.2 76.3
## 8 ABW Aruba North America 106459 41.2 76.3
## 9 ABW Aruba North America 106459 41.2 76.3
## 10 ABW Aruba North America 106459 41.2 76.3
## # ℹ 41,284 more rows
## # ℹ 89 more variables: aged_65_older <dbl>, aged_70_older <dbl>,
## # gdp_per_capit <dbl>, extreme_poverty <dbl>, cardiovasc_death_rat <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # human_development_index <dbl>, country <chr>, year <dbl>,
## # population.y <dbl>, gdp <dbl>, cement_co2 <dbl>, …
The statement by = "iso_code"
is equivalent to join_by(iso_code == iso_code)
. The second has the advantage that we could use a common variable with different names in both tables.
The left_join
function only works when the common variable in both tables is identical, if the common variable is a string (for example the name of a municipality) there may be small differences when writing the municipality in both tables which makes the comparison fail. Ideally, in all tables, each record should be accompanied
of a unique and universal identification code. For example, the INE uses an identification code for each municipality, but unfortunately, this code does not accompany
systematically to the tables that result from making a query in the INE or in the ISTAC related to municipalities. And what’s worse, the names of the municipalities can change slightly from one query to another, and therefore using the name of the municipality as an identifier can cause problems. To resolve
these problems, the first thing to do when reading a INE or ISTAC table is to try to associate the INE universal identification code from the name of the municipality (if we are managing municipalities ). We will take the INE codes of the municipalities from the following table that we filter to only be left with the Canarian municipalities.
uclm_mun <- read.xlsx("https://ctim.es/AEDV/data/uclm-list-mun-2012.xlsx",sheet=1) %>%
as_tibble() %>% filter(CA=="05")
uclm_mun %>%
str()
## tibble [88 × 11] (S3: tbl_df/tbl/data.frame)
## $ codine : chr [1:88] "35001" "35002" "35003" "35004" ...
## $ Municipio : chr [1:88] "Agaete" "Agüimes" "Antigua" "Arrecife" ...
## $ Superficie : num [1:88] 45.5 79.3 250.6 22.7 66.7 ...
## $ Capitalidad : chr [1:88] "Agaete" "Agüimes" "Antigua" "Arrecife" ...
## $ Año : num [1:88] NA NA NA NA NA NA NA NA NA NA ...
## $ CA : chr [1:88] "05" "05" "05" "05" ...
## $ Autonomía : chr [1:88] "Canarias" "Canarias" "Canarias" "Canarias" ...
## $ CP : chr [1:88] "35" "35" "35" "35" ...
## $ Provincia : chr [1:88] "Palmas (Las)" "Palmas (Las)" "Palmas (Las)" "Palmas (Las)" ...
## $ CPJ : chr [1:88] "3504" "3508" "3503" "3501" ...
## $ Partido_Judicial: chr [1:88] "Santa María de Guía de Gran Canaria" "Santa Lucía de Tirajana" "Puerto del Rosario" "Arrecife" ...
Next we are going to read a table from ISTAC on municipalities. We have modified the original table by adding variables with the names of some municipalities slightly changed to have several options when comparing the names: :
istac_municipios <- read.xlsx("https://ctim.es/AEDV/data/istac_municipios.xlsx") %>%
as_tibble()
istac_municipios %>%
filter(is.na(nombre3)==FALSE) %>%
print(n=10)
## # A tibble: 19 × 5
## nombre nombre2 nombre3 isla superficie
## <chr> <chr> <chr> <chr> <dbl>
## 1 Aldea de San Nicolás (La) La Aldea de San Nicolás La Ald… Gran… 124.
## 2 Palmas de Gran Canaria (Las) Las Palmas de Gran Can… Las Pa… Gran… 101.
## 3 Llanos de Aridane (Los) Los Llanos de Aridane Los Ll… La P… 35.8
## 4 Puerto de la Cruz <NA> P. de … Tene… 8.73
## 5 Puerto del Rosario <NA> P. del… Fuer… 290.
## 6 San Andrés y Sauces <NA> S. And… La P… 42.8
## 7 San Bartolomé <NA> S. Bar… Lanz… 40.9
## 8 San Bartolomé de Tirajana San Bartolomé de Tiraj… S. B. … Gran… 333.
## 9 San Cristóbal de La Laguna <NA> La Lag… Tene… 102.
## 10 San Juan de la Rambla <NA> S.J. d… Tene… 20.7
## # ℹ 9 more rows
Since the istac_municipios
table does not have the code INE, we are going to add it. To do this, first of all, we are going to check if through the names of the municipalities in both tables it is possible to find a “matching” without errors between the two tables. For this we are going to use the LeftJoinNearestString
function implemented by us in the file utilidades.R which makes a “left join” between two tibbles of strings u
and v
that correspond to the possible names that we are going to compare in the 2 tables using the adist
function. This function returns a tibble with the following variables:
pos
: vector with the index of the record in tablev
that has the smallest distance from that of tableu
value1
: the string within tableu
that most closely resembles the string in tablev
value2
: the string within tablev
that most closely resembles the one in tableu
dis
: the distance between the strings
value1and
value2`
This function has as additional parameters, the parameters ignore.case
and partial
which by default are assigned as ignore.case=FALSE
and partial=FALSE
, but which can be modified by adding them in the function call.
Note that this function does not generate the combined tables, but with the pos
variable we can match the records from one table to another. Therefore, based on this information we can combine the tables as we wish. The `LeftJoinNearestString function has a high computational cost and is only useful for relatively small databases. If applied to tables with many records it can take a long time.
It must be taken into account that the comparison may fail and therefore it is advisable to verify and correct the results. Next we do the “matching” between the names of the municipalities in both tables and print the results of those with a distance greater than zero.
source("utilidades.R")
res <- LeftJoinNearestString(istac_municipios%>%select(nombre:nombre3),uclm_mun%>%select(Municipio))
res%>%
arrange(desc(dis)) %>%
filter(dis>0) %>%
print(n=100)
## # A tibble: 15 × 4
## pos value1 value2 dis
## <int> <chr> <chr> <dbl>
## 1 88 Pinar del Hierro (El) Pinar de El Hierro, El 5
## 2 61 Paso (El) Paso, El 3
## 3 66 Rosario (El) Rosario, El 3
## 4 75 Sauzal (El) Sauzal, El 3
## 5 78 Tanque (El) Tanque, El 3
## 6 20 Aldea de San Nicolás (La) Aldea de San Nicolás, La 3
## 7 52 Guancha (La) Guancha, La 3
## 8 59 Matanza de Acentejo (La) Matanza de Acentejo, La 3
## 9 14 Oliva (La) Oliva, La 3
## 10 60 Orotava (La) Orotava, La 3
## 11 85 Victoria de Acentejo (La) Victoria de Acentejo, La 3
## 12 16 Palmas de Gran Canaria (Las) Palmas de Gran Canaria, Las 3
## 13 58 Llanos de Aridane (Los) Llanos de Aridane, Los 3
## 14 65 Realejos (Los) Realejos, Los 3
## 15 76 Silos (Los) Silos, Los 3
We observe, inspecting the result, that the matching has no errors and we proceed to add a variable to istac_municipios
with the code of each municipality and save the new table on disk. This way, in the future we could directly use this new table with the INE code.
ine_code <- integer(length(istac_municipios$nombre))
for(i in 1:length(ine_code) ){
ine_code[i] <- uclm_mun$codine[res$pos[i]]
}
istac_municipios$ine_code <- ine_code
istac_municipios %>%
relocate(ine_code) %>%
write.xlsx(file="istac_municipios_ine_codes.xlsx")
Now we are going to try to do the same for the istac_population
table that we previously filtered to keep the data for the year 2021
istac_population2021 <- istac_population %>%
filter(Años==2021) %>%
mutate(Municipalities.by.islands=as.character(Municipios.por.islas))
istac_population2021
## # A tibble: 96 × 6
## Sexos Años Edades.año.a.año Municipios.por.islas value
## <fct> <fct> <fct> <fct> <dbl>
## 1 AMBOS SEXOS 2021 TOTAL CANARIAS 2172944
## 2 AMBOS SEXOS 2021 TOTAL LANZAROTE 156189
## 3 AMBOS SEXOS 2021 TOTAL Arrecife 64497
## 4 AMBOS SEXOS 2021 TOTAL Haría 5365
## 5 AMBOS SEXOS 2021 TOTAL San Bartolomé 19058
## 6 AMBOS SEXOS 2021 TOTAL Teguise 23044
## 7 AMBOS SEXOS 2021 TOTAL Tías 20801
## 8 AMBOS SEXOS 2021 TOTAL Tinajo 6447
## 9 AMBOS SEXOS 2021 TOTAL Yaiza 16977
## 10 AMBOS SEXOS 2021 TOTAL FUERTEVENTURA 119662
## # ℹ 86 more rows
## # ℹ 1 more variable: Municipalities.by.islands <chr>
res <- LeftJoinNearestString(istac_municipios%>%select(nombre:nombre3),istac_population2021%>%select(Municipios.por.islas))
res%>%
arrange(desc(dis)) %>%
filter(dis>0) %>%
print(n=100)
## # A tibble: 88 × 4
## pos value1 value2 dis
## <int> <chr> <chr> <dbl>
## 1 1 Santa María de Guía 1 19
## 2 1 Las Palmas de G.C. 1 18
## 3 1 S. Andrés y Sauces 1 18
## 4 1 S. B. de Tirajana 1 17
## 5 1 S.J. de la Rambla 1 17
## 6 1 S.S. de la Gomera 1 17
## 7 1 S.C. de la Palma 1 16
## 8 1 S.C. de Tenerife 1 16
## 9 1 P. del Rosario 1 14
## 10 1 Valle Gran Rey 1 14
## 11 1 Guía de Isora 1 13
## 12 1 P. de la Cruz 1 13
## 13 1 Fuencaliente 1 12
## 14 1 Los Realejos 1 12
## 15 1 S. Bartolomé 1 12
## 16 1 S. del Teide 1 12
## 17 1 Vallehermoso 1 12
## 18 1 La Victoria 1 11
## 19 1 Santa Lucía 1 11
## 20 1 Valsequillo 1 11
## 21 1 Barlovento 1 10
## 22 1 Betancuria 1 10
## 23 1 Breña Alta 1 10
## 24 1 Breña Baja 1 10
## 25 1 Buenavista 1 10
## 26 1 Candelaria 1 10
## 27 1 El Rosario 1 10
## 28 1 Granadilla 1 10
## 29 1 La Guancha 1 10
## 30 1 La Matanza 1 10
## 31 1 La Orotava 1 10
## 32 1 Los Llanos 1 10
## 33 1 Puntagorda 1 10
## 34 1 Puntallana 1 10
## 35 1 San Miguel 1 10
## 36 1 S. Brígida 1 10
## 37 1 El Sauzal 1 9
## 38 1 El Tanque 1 9
## 39 1 Garachico 1 9
## 40 1 Los Silos 1 9
## 41 1 La Laguna 1 9
## 42 1 S. Úrsula 1 9
## 43 1 Tacoronte 1 9
## 44 1 Tazacorte 1 9
## 45 1 Valleseco 1 9
## 46 1 San Mateo 1 9
## 47 1 Arrecife 1 8
## 48 1 Artenara 1 8
## 49 1 Frontera 1 8
## 50 1 Hermigua 1 8
## 51 1 La Aldea 1 8
## 52 1 La Oliva 1 8
## 53 1 El Pinar 1 8
## 54 1 Tegueste 1 8
## 55 1 Tijarafe 1 8
## 56 1 Valverde 1 8
## 57 1 Vilaflor 1 8
## 58 1 Agüimes 1 7
## 59 1 Alajeró 1 7
## 60 1 Antigua 1 7
## 61 1 El Paso 1 7
## 62 1 Garafía 1 7
## 63 1 Ingenio 1 7
## 64 1 Teguise 1 7
## 65 1 Tuineje 1 7
## 66 1 Agaete 1 6
## 67 1 Arucas 1 6
## 68 1 Fasnia 1 6
## 69 1 Firgas 1 6
## 70 1 Gáldar 1 6
## 71 1 Güímar 1 6
## 72 1 Pájara 1 6
## 73 1 Tejeda 1 6
## 74 1 Tinajo 1 6
## 75 1 Adeje 1 5
## 76 1 Agulo 1 5
## 77 1 Arafo 1 5
## 78 1 Arico 1 5
## 79 1 Arona 1 5
## 80 1 Haría 1 5
## 81 1 Mogán 1 5
## 82 1 Telde 1 5
## 83 1 Teror 1 5
## 84 1 Yaiza 1 5
## 85 1 Icos 1 4
## 86 1 Moya 1 4
## 87 1 Tías 1 4
## 88 1 Mazo 1 4
In this case, we observe that in the municipality of “La Laguna” the matching fails. A simple way to solve this is to edit by hand the file where the istac_municipios
table is stored and add as a possible name of the municipality the name with which it appears in the other table and re-launch the R
script to check if the problem has been resolved.
Tidy data
The same information can be organized in many different ways. However, good organization requires meeting certain criteria for the orderly organization of tidy data. These criteria are essentially:
Each variable (or field) you measure must correspond to a single column. That is, instances of the same variable cannot be assigned to different columns.
Each row can only contain data from a single observation. In other word, in each row you can have several categorical variables (including dates) but a single observation value.
There must be a different table for each “class” of variable. That is, heterogeneous data cannot be combined in the same table.
If you have multiple tables, there must be a column in each table that allows them to be linked.
Let’s illustrate this concept with an example of a tibble of some population data (in millions) from Spain, France and Germany
data <- tibble(
year=c(1985,2000,2015),
Spain=c(38.734,40.75,46.122),
France=c(55.38,59.387,64.395),
Germany=c(77.57,81.896,80.689))
data%>%
knitr::kable( caption = "Unsorted data table", format = "html", table.attr = "style='width:50%;'")
year | Spain | France | Germany |
---|---|---|---|
1985 | 38.734 | 55.380 | 77.570 |
2000 | 40.750 | 59.387 | 81.896 |
2015 | 46.122 | 64.395 | 80.689 |
This way of organizing a table is frequently used because it is a natural way we would write the data if we were doing it by hand. However, it does not meet the “tidy data” criteria because the variables with the country names are instances of the country variable, and therefore the table columns should be year, country and population. The pivot_longer
function automatically converts the previous tibble into a “tidy” tibble by removing columns and adding rows:
data %>%
pivot_longer(c(`Spain`, `France`, `Germany`), names_to = "country", values_to = "population")%>%
knitr::kable( caption = "Ordered data table", format = "html", table.attr = "style='width:30%;'")
year | country | population |
---|---|---|
1985 | Spain | 38.734 |
1985 | France | 55.380 |
1985 | Germany | 77.570 |
2000 | Spain | 40.750 |
2000 | France | 59.387 |
2000 | Germany | 81.896 |
2015 | Spain | 46.122 |
2015 | France | 64.395 |
2015 | Germany | 80.689 |
An important advantage of tidy data is that there are many processes and functions designed to work with this information organization. In particular, data visualization processes require “tidy data.” Sometimes it is interesting to do the opposite operation to pivot_longer
, which is pivot_wider
that organizes
the table by adding columns and removing rows. This allows you to separate the variables and manage them individually:
data %>%
pivot_longer(c(`Spain`, `France`, `Germany`), names_to = "country", values_to = "population") %>%
pivot_wider(names_from = country, values_from = population)
## # A tibble: 3 × 4
## year Spain France Germany
## <dbl> <dbl> <dbl> <dbl>
## 1 1985 38.7 55.4 77.6
## 2 2000 40.8 59.4 81.9
## 3 2015 46.1 64.4 80.7
We note that doing a pivot_longer
and then a pivot_wider
takes us to the original table. These operations allow us to manage
the same table in the format that interests us at all times.
As a summary of this section it may be useful to look at the following summary sheet.
The organization of the public databases that we use in this course differs quite a bit between them. For example, the OWID tables contain many columns with different variables and each line of the table is an observation of all the variables for a country on a specific date. However, the UN tables contain a column that contains the name of the variable and a single column with the numerical value for said variable, in such a way that each line includes only the observation of one variable. The INE and ISTAC bases have a similar organization to that of UN.
Combining data from UN and OWID
OWID databases use a 3-letter ISO (International Organization for Standardization) code to identify countries. However, UN databases use their own encoding. To combine tables from the two databases we will use a table that provides the UN with the equivalences between different country codes that we have stored in the file “data/UN_code.xlsx”
UN_code <- read.xlsx("https://ctim.es/AEDV/data/UN_code.xlsx",sheet=1) %>%
as_tibble()
UN_code %>%
str()
## tibble [246 × 9] (S3: tbl_df/tbl/data.frame)
## $ ctyCode : num [1:246] 533 4 24 660 8 20 784 32 51 16 ...
## $ ISO_A2 : chr [1:246] "AW" "AF" "AO" "AI" ...
## $ ISO_A3 : chr [1:246] "ABW" "AFG" "AGO" "AIA" ...
## $ cty.Name.English : chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
## $ cty.Fullname.English: chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
## $ Cty.Abbreviation : chr [1:246] "Aruba" "Afghanistan" "Angola" "Anguilla" ...
## $ Cty.Comments : chr [1:246] "NULL" "NULL" "NULL" "NULL" ...
## $ Start.Valid.Year : num [1:246] 1988 1962 1962 1981 1962 ...
## $ End.Valid.Year : num [1:246] 2061 2061 2061 2061 2061 ...
In this table, 3 codes appear for each country: a numerical code appears in the ctyCode
variable, a 2-character code appears in the ISO_A2
variable, and a 3-character code appears in the ISO_A3
variable.
Next we read a table from UN, we change the name of some variables of interest, we select
the variables and do a pivot_wider
UN_population <- as_tibble(read_csv("https://ctim.es/AEDV/data/UN_population.csv",skip=1))%>%
rename(un_code=`Region/Country/Area`,country=...2) %>%
select(un_code,country,Year,Series,Value) %>%
pivot_wider(names_from = Series, values_from = Value)
UN_population %>%
str()
## tibble [1,056 × 11] (S3: tbl_df/tbl/data.frame)
## $ un_code : num [1:1056] 1 1 1 1 2 2 2 2 15 15 ...
## $ country : chr [1:1056] "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
## $ Year : num [1:1056] 2010 2015 2020 2022 2010 ...
## $ Population mid-year estimates (millions) : num [1:1056] 6986 7427 7841 7975 1055 ...
## $ Population mid-year estimates for males (millions) : num [1:1056] 3514 3737 3944 4009 526 ...
## $ Population mid-year estimates for females (millions): num [1:1056] 3471 3689 3897 3967 529 ...
## $ Sex ratio (males per 100 females) : num [1:1056] 101.2 101.3 101.2 101.1 99.3 ...
## $ Population aged 0 to 14 years old (percentage) : num [1:1056] 27.1 26.4 25.7 25.3 41.5 41.3 40.5 40.1 32.4 32.8 ...
## $ Population aged 60+ years old (percentage) : num [1:1056] 11.1 12.3 13.5 13.9 5 5.2 5.4 5.5 6.6 7.3 ...
## $ Population density : num [1:1056] 53.6 57 60.1 61.2 35.7 40.6 46 48.3 26.9 29.7 ...
## $ Surface area (thousand km2) : num [1:1056] NA 136162 130094 NA NA ...
Now, using left_join
we will join the tables and sort their variables a bit
owid_country %>%
select(iso_code:population) %>%
left_join(UN_code,join_by(iso_code==ISO_A3)) %>%
left_join(UN_population,join_by (ctyCode==un_code)) %>%
relocate(Year, .after=iso_code) %>%
str()
## tibble [882 × 22] (S3: tbl_df/tbl/data.frame)
## $ iso_code : chr [1:882] "ABW" "ABW" "ABW" "ABW" ...
## $ Year : num [1:882] 2010 2015 2020 2022 2010 ...
## $ location : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ continent : chr [1:882] "North America" "North America" "North America" "North America" ...
## $ population : num [1:882] 106459 106459 106459 106459 41128772 ...
## $ ctyCode : num [1:882] 533 533 533 533 4 4 4 4 24 24 ...
## $ ISO_A2 : chr [1:882] "AW" "AW" "AW" "AW" ...
## $ cty.Name.English : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ cty.Fullname.English : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ Cty.Abbreviation : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ Cty.Comments : chr [1:882] "NULL" "NULL" "NULL" "NULL" ...
## $ Start.Valid.Year : num [1:882] 1988 1988 1988 1988 1962 ...
## $ End.Valid.Year : num [1:882] 2061 2061 2061 2061 2061 ...
## $ country : chr [1:882] "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ Population mid-year estimates (millions) : num [1:882] 0.1 0.1 0.11 0.11 28.19 ...
## $ Population mid-year estimates for males (millions) : num [1:882] 0.05 0.05 0.05 0.05 14.24 ...
## $ Population mid-year estimates for females (millions): num [1:882] 0.05 0.05 0.06 0.06 13.95 ...
## $ Sex ratio (males per 100 females) : num [1:882] 91.2 90.2 89.1 89.2 102.1 ...
## $ Population aged 0 to 14 years old (percentage) : num [1:882] 20.7 18.8 17.4 16.2 49 45.8 43.8 43.1 45.7 45.8 ...
## $ Population aged 60+ years old (percentage) : num [1:882] 14.7 18.2 22.4 23.9 3.8 3.8 3.8 3.8 4.1 4.1 ...
## $ Population density : num [1:882] 557.5 579.2 592.1 591.4 43.4 ...
## $ Surface area (thousand km2) : num [1:882] NA 0 0 NA NA ...
3.4 Initial exploratory analysis
Any data analysis must begin by understanding the content and organization of the data to be used. To do this, we will systematically address the following preliminary steps:
- Open the files with the data in text format or with Excel (if it is an Excel sheet) and analyze its content.
- Locate and understand the possible associated metadata. The metadata that, in general, accompany the data, provide us with very valuable information to understand the content of the data. This metadata may be in other files or included somewhere in the data file. Keep in mind that in this case, the metadata is lost when loading the data as a tibble and you have to look at it in the original data files. This happens, for example, in files in
px
format. - Identify the types of the variables. If there is a variable with date information, identify the format. Study if it is necessary to coerce variables and transform the variable with date information.
- Identify if the data has a ‘tidy’ organization, that is, if each row includes a single observation accompanied by possible categorical variables (used to classify) or, on the contrary, each row contains multiple observations.
- Identify the range of categorical variables to understand the possible data classification items. For example, if the categorical variable is sex, identify what possible values this variable can take. To do this we can use the
R
functionalitylevels(as.factor(VariableName))
- Identify if there is an index variable that will allow us to combine the information from our data table with other tables.
- Analyze the absence of data ‘NA’ and identify how the absence of data is coded in the analyzed data. In the case of time series, which is data that changes over time, the absence of data may be due to missing intermediate dates. This in
R
can be detected using thelag
function to compare a date variable with the same variable shifted by one value. Let’s look at an example:
## Time differences in days
## [1] NA 1 2 1
In this way, we can detect that dates are missing when the value of date-lag(date)
is greater than 1. All this initial exploratory analysis is essential and must be done as a preliminary step to any subsequent data processing.
References
[Pe15] Roger D. Peng. R Programming for Data Science, Lulu, 2015.
[WiÇeGa23] Wickham, Hadley, Mine Çetinkaya-Rundel and Garrett Grolemund. R for Data Science (2e), O’Reilly Media, 2023.