Chapter 3 Data Wrangling

In the examples we have seen so far, the data that we have used has come from a single data set and has largely been neat and clean. This is unfortunately seldom the case for real-world projects! One of the most challenging aspects of making compelling visualizations is to prepare the data. In this chapter, we will cover some basic principles and tools that will help you ensure that your data sets are in shape.

3.1 Tidy Data

Most of the time you will spend working with data will be spend cleaning it! This is because we often have messy data. We should thus attempt to tidy it based on the following principles:

  • each variable has its own column
  • each observation has its own row
  • each value has its own cell

Messy data violates any one of these principles, In this chapter, we will practice some strategies for cleaning, reshaping and merging data sets.

3.2 WDI Example

We can look up the variables available through this package with the command WDIsearch(). Below, we are getting data on country population from the World Development Indicators API.1

library(WDI)
raw_wdi <- WDI(indicator = "NY.GDP.PCAP.CD") 

Some basic tidyverse syntax allows us to clean and subset out data by selecting columns, renaming variables, and filtering rows.

df_wdi <- raw_wdi %>% 
  # Let's select some columns
  select(iso2c, year, NY.GDP.PCAP.CD) %>% 
  # Let's rename some variables
  rename(gdp_pc_usd = NY.GDP.PCAP.CD, 
         country_code = iso2c) %>% 
  # Let's filter for some countries/years
  filter(country_code %in% c("FR","DE","ES","HU","IT","GR"),
         year %in% 1995:2020)
Table 3.1: WDI GDP per capita Data (API)
country_code year gdp_pc_usd
FR 2020 39030.36
FR 2019 40578.64
FR 2018 41572.49
FR 2017 38685.26
FR 2016 37037.37
FR 2015 36638.18
FR 2014 43011.26
FR 2013 42592.94
FR 2012 40874.70
FR 2011 43790.73
FR 2010 40638.34
FR 2009 41575.42
FR 2008 45334.12
FR 2007 41508.43
FR 2006 36443.63
FR 2005 34760.19
FR 2004 33741.26
FR 2003 29568.39
FR 2002 24177.34
FR 2001 22433.55
FR 2000 22364.03
FR 1999 24673.20
FR 1998 24974.27
FR 1997 24228.95
FR 1996 26871.83
FR 1995 26890.22
DE 2020 46208.43
DE 2019 46794.90
DE 2018 47950.18
DE 2017 44542.30
DE 2016 42107.52
DE 2015 41086.73
DE 2014 47959.99
DE 2013 46285.76
DE 2012 43858.36
DE 2011 46644.78
DE 2010 41531.93
DE 2009 41485.90
DE 2008 45427.15
DE 2007 41587.21
DE 2006 36323.45
DE 2005 34507.37
DE 2004 34044.05
DE 2003 30243.58
DE 2002 25077.73
DE 2001 23607.88
DE 2000 23635.93
DE 1999 26725.92
DE 1998 27289.06
DE 1997 26964.05
df_wdi %>% 
  filter(country_code%notin%c("IT","ES")) %>%
  ggplot(aes(x=year,y=gdp_pc_usd))+
  geom_line(aes(group=country_code,
                color=country_code,
                ),
            size=1)+
  geom_point(aes(color=country_code),
             shape=15)+
  scale_color_manual(values = cb_palette)+
  scale_y_continuous(labels=scales::dollar_format())+
  labs(title = "GDP Per Capita in Europe",
       subtitle = "Current USD", 
       y=NULL,
       x=NULL,
       color="Country")+
  theme_classic()

3.2.1 Wide to long format

Our raw_pop object contains the original (raw) file from the World Bank. You can download this directly from the the World Bank’s website or from here. In the code below, we are reading this MS Excel file (.xls) into an object raw_pop, skipping the first three rows of the original document.[^Note that here our file is in a folder called data.]

raw_pop <- readxl::read_excel("data/API_SP.POP.TOTL_DS2_en_excel_v2_3158861.xls",
                                    skip = 3) 
Table 3.2: WDI Population Data (Wide)
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Aruba ABW Population, total SP.POP.TOTL 54208 55434 56234 56699 57029 57357 57702 58044 58377 58734 59070 59442 59849 60236 60527 60653 60586 60366 60102 59972 60097 60561 61341 62213 62826 63024 62645 61838 61072 61033 62152 64623 68240 72495 76705 80324 83211 85450 87280 89009 90866 92892 94992 97016 98744 100028 100830 101226 101362 101452 101665 102050 102565 103165 103776 104339 104865 105361 105846 106310 106766
Africa Eastern and Southern AFE Population, total SP.POP.TOTL 130836765 134159786 137614644 141202036 144920186 148769974 152752671 156876454 161156430 165611760 170257189 175100167 180141148 185376550 190800796 196409937 202205766 208193045 214368393 220740384 227305945 234058404 240999134 248146290 255530063 263161451 271050065 279184536 287524258 296024639 304648010 313394693 322270073 331265579 340379934 349605660 358953595 368440591 378098393 387977990 398113044 408522129 419223717 430246635 441630149 453404076 465581372 478166911 491173160 504604672 518468229 532760424 547482863 562601578 578075373 593871847 609978946 626392880 643090131 660046272 677243299
Afghanistan AFG Population, total SP.POP.TOTL 8996967 9169406 9351442 9543200 9744772 9956318 10174840 10399936 10637064 10893772 11173654 11475450 11791222 12108963 12412960 12689164 12943093 13171294 13341199 13411060 13356500 13171679 12882518 12537732 12204306 11938204 11736177 11604538 11618008 11868873 12412311 13299016 14485543 15816601 17075728 18110662 18853444 19357126 19737770 20170847 20779957 21606992 22600774 23680871 24726689 25654274 26433058 27100542 27722281 28394806 29185511 30117411 31161378 32269592 33370804 34413603 35383028 36296111 37171922 38041757 38928341
Africa Western and Central AFW Population, total SP.POP.TOTL 96396419 98407221 100506960 102691339 104953470 107289875 109701811 112195950 114781116 117468741 120269044 123184308 126218502 129384954 132699537 136173544 139813171 143615715 147571063 151663853 155882270 160223588 164689764 169279422 173991851 178826553 183785612 188868567 194070079 199382783 204803865 210332267 215976366 221754806 227692136 233807627 240114179 246613750 253302310 260170348 267214544 274433894 281842480 289469530 297353098 305520588 313985474 322741656 331772330 341050537 350556886 360285439 370243017 380437896 390882979 401586651 412551299 423769930 435229381 446911598 458803476
Angola AGO Population, total SP.POP.TOTL 5454938 5531451 5608499 5679409 5734995 5770573 5781305 5774440 5771973 5803677 5890360 6041239 6248965 6497283 6761623 7023994 7279630 7533814 7790774 8058112 8341290 8640478 8952971 9278104 9614756 9961993 10320116 10689247 11068051 11454784 11848385 12248901 12657361 13075044 13503753 13945205 14400722 14871572 15359600 15866871 16395477 16945753 17519418 18121477 18758138 19433604 20149905 20905360 21695636 22514275 23356247 24220660 25107925 26015786 26941773 27884380 28842482 29816769 30809787 31825299 32866268
Albania ALB Population, total SP.POP.TOTL 1608800 1659800 1711319 1762621 1814135 1864791 1914573 1965598 2022272 2081695 2135479 2187853 2243126 2296752 2350124 2404831 2458526 2513546 2566266 2617832 2671997 2726056 2784278 2843960 2904429 2964762 3022635 3083605 3142336 3227943 3286542 3266790 3247039 3227287 3207536 3187784 3168033 3148281 3128530 3108778 3089027 3060173 3051010 3039616 3026939 3011487 2992547 2970017 2947314 2927519 2913021 2905195 2900401 2895092 2889104 2880703 2876101 2873457 2866376 2854191 2837743
Andorra AND Population, total SP.POP.TOTL 13410 14378 15379 16407 17466 18542 19646 20760 21886 23053 24275 25571 26885 28232 29515 30705 31782 32769 33744 34825 36063 37498 39115 40854 42706 44593 46520 48459 50433 52452 54508 56666 58882 60974 62676 63860 64363 64318 64140 64368 65390 67344 70048 73180 76250 78871 80995 82682 83860 84461 84454 83748 82427 80770 79213 77993 77295 76997 77008 77146 77265
Arab World ARB Population, total SP.POP.TOTL 92197715 94724540 97334438 100034191 102832792 105736428 108758634 111899335 115136161 118437193 121785630 125164720 128598743 132161302 135952270 140040580 144453278 149161836 154111160 159218539 164420771 169698978 175061794 180505967 186035280 191650328 197338140 203084958 208889669 214753965 222653371 228731671 232956364 239243294 245449429 253107302 259000937 264822167 270575777 276393809 282344141 288432153 294665202 301113869 307862846 314965776 322452764 330290752 338395936 346629179 354890097 363156846 371437642 379696477 387899835 396028301 404042892 411942825 419851989 427870273 436080728
United Arab Emirates ARE Population, total SP.POP.TOTL 92417 100801 112112 125130 138049 149855 159979 169768 182620 203103 234512 277463 330968 394625 467457 548295 637926 735347 835498 931752 1019507 1096602 1164816 1228457 1293970 1366165 1446386 1533526 1627068 1725676 1828437 1937159 2052892 2173135 2294377 2415099 2539121 2671361 2813214 2966029 3134067 3302722 3478769 3711931 4068577 4588222 5300172 6168846 7089486 7917368 8549998 8946778 9141598 9197908 9214182 9262896 9360975 9487206 9630966 9770526 9890400
Argentina ARG Population, total SP.POP.TOTL 20481781 20817270 21153042 21488916 21824427 22159644 22494031 22828872 23168268 23517613 23880564 24259564 24653172 25056475 25462305 25865775 26264681 26661397 27061041 27471046 27896532 28338514 28794550 29262049 29737097 30216284 30698964 31184411 31668939 32148137 32618648 33079002 33529320 33970103 34402669 34828168 35246376 35657438 36063451 36467218 36870796 37275644 37681743 38087866 38491970 38892924 39289876 39684303 40080159 40482786 40788453 41261490 41733271 42202935 42669500 43131966 43590368 44044811 44494502 44938712 45376763
Armenia ARM Population, total SP.POP.TOTL 1874119 1941498 2009524 2077584 2145004 2211316 2276038 2339133 2401142 2462938 2525067 2587716 2650484 2712780 2773750 2832752 2889583 2944375 2997419 3049107 3099759 3148096 3193696 3238592 3285593 3335935 3392264 3451947 3504667 3536473 3538164 3505249 3442820 3363111 3283664 3217349 3168213 3133081 3108691 3089020 3069597 3050686 3033976 3017938 3000715 2981262 2958301 2932615 2907615 2888094 2877314 2876536 2884239 2897593 2912403 2925559 2936147 2944789 2951741 2957728 2963234
American Samoa ASM Population, total SP.POP.TOTL 20127 20605 21246 22029 22850 23675 24473 25235 25980 26698 27362 27982 28564 29103 29595 30045 30455 30834 31262 31842 32648 33697 34969 36413 37946 39521 41114 42741 44346 45894 47351 48682 49900 51025 52099 53158 54209 55227 56180 57049 57816 58496 59077 59495 59684 59557 59109 58367 57490 56675 56084 55755 55669 55717 55791 55806 55739 55617 55461 55312 55197
Antigua and Barbuda ATG Population, total SP.POP.TOTL 54132 55005 55849 56701 57641 58699 59912 61240 62523 63553 64184 64354 64134 63649 63108 62671 62353 62162 62038 61948 61861 61789 61780 61779 61784 61785 61754 61713 61758 62007 62533 63363 64459 65777 67201 68672 70176 71707 73219 74674 76007 77212 78298 79311 80347 81462 82715 84029 85394 86743 88030 89250 90407 91510 92562 93571 94520 95425 96282 97115 97928
Australia AUS Population, total SP.POP.TOTL 10276477 10483000 10742000 10950000 11167000 11388000 11651000 11799000 12009000 12263000 12507000 12937000 13177000 13380000 13723000 13893000 14033000 14192000 14358000 14514000 14692000 14927000 15178000 15369000 15544000 15758000 16018400 16263900 16532200 16814400 17065100 17284000 17495000 17667000 17855000 18072000 18311000 18517000 18711000 18926000 19153000 19413000 19651400 19895400 20127400 20394800 20697900 20827600 21249200 21691700 22031750 22340024 22733465 23128129 23475686 23815995 24190907 24601860 24982688 25365745 25687041
Austria AUT Population, total SP.POP.TOTL 7047539 7086299 7129864 7175811 7223801 7270889 7322066 7376998 7415403 7441055 7467086 7500482 7544201 7586115 7599038 7578903 7565525 7568430 7562305 7549425 7549433 7568710 7574140 7561910 7561434 7564985 7569794 7574586 7585317 7619567 7677850 7754891 7840709 7905633 7936118 7948278 7959017 7968041 7976789 7992324 8011566 8042293 8081957 8121423 8171966 8227829 8268641 8295487 8321496 8343323 8363404 8391643 8429991 8479823 8546356 8642699 8736668 8797566 8840521 8879920 8917205
Azerbaijan AZE Population, total SP.POP.TOTL 3895398 4030325 4171428 4315127 4456691 4592601 4721528 4843872 4960237 5071930 5180032 5284518 5385266 5483088 5579071 5674129 5768730 5863138 5957927 6053635 6150735 6249312 6349555 6452067 6557581 6666447 6778631 6893486 7010027 7126877 7159000 7271000 7382000 7495000 7597000 7685000 7763000 7838250 7913000 7982750 8048600 8111200 8171950 8234100 8306500 8391850 8484550 8581300 8763400 8947243 9054332 9173082 9295784 9416801 9535079 9649341 9757812 9854033 9939771 10024283 10110116
Burundi BDI Population, total SP.POP.TOTL 2797925 2852438 2907320 2964416 3026292 3094378 3170496 3253215 3336930 3413909 3479070 3530000 3569655 3605120 3646428 3700879 3770870 3854446 3949264 4051239 4157296 4266520 4379727 4497544 4621096 4750832 4886745 5027143 5168703 5307069 5438959 5564923 5685569 5798054 5898964 5987044 6060110 6122130 6185564 6267132 6378871 6525546 6704118 6909161 7131688 7364857 7607850 7862226 8126104 8397661 8675606 8958406 9245992 9540302 9844301 10160034 10488002 10827010 11175379 11530577 11890781
Belgium BEL Population, total SP.POP.TOTL 9153489 9183948 9220578 9289770 9378113 9463667 9527807 9580991 9618756 9646032 9655549 9673162 9711115 9741720 9772419 9800700 9818227 9830358 9839534 9848382 9859242 9858982 9856303 9855520 9855372 9858308 9861823 9870234 9901664 9937697 9967379 10004486 10045158 10084475 10115603 10136811 10156637 10181245 10203008 10226419 10251250 10286570 10332785 10376133 10421137 10478617 10547958 10625700 10709973 10796493 10895586 11038264 11106932 11159407 11209057 11274196 11331422 11375158 11427054 11488980 11555997
Benin BEN Population, total SP.POP.TOTL 2431617 2465865 2502897 2542864 2585961 2632361 2682159 2735308 2791588 2850657 2912338 2976575 3043563 3113681 3187413 3265167 3347169 3433445 3523933 3618519 3717161 3820126 3927717 4039940 4156819 4278502 4404504 4535262 4672844 4820020 4978489 5149496 5331805 5521761 5714215 5905552 6094272 6281644 6470275 6664102 6865946 7076728 7295400 7520556 7750003 7982223 8216893 8454790 8696915 8944713 9199254 9460829 9729254 10004594 10286839 10575962 10872072 11175192 11485035 11801151 12123198
Burkina Faso BFA Population, total SP.POP.TOTL 4829289 4894580 4960328 5027811 5098891 5174874 5256360 5343025 5434046 5528172 5624592 5723378 5825174 5930493 6040045 6154554 6274032 6398933 6530820 6671656 6822837 6985166 7158259 7340910 7531239 7727908 7930689 8140080 8356313 8579818 8811033 9050086 9297110 9552473 9816584 10089880 10372734 10665552 10968722 11282696 11607951 11944589 12293097 12654624 13030576 13421935 13829173 14252029 14689725 15141098 15605211 16081915 16571252 17072791 17586029 18110616 18646350 19193236 19751466 20321383 20903278
Bangladesh BGD Population, total SP.POP.TOTL 48013505 49362834 50752150 52202008 53741721 55385114 57157651 59034250 60918452 62679765 64232486 65531635 66625706 67637541 68742222 70066310 71652386 73463593 75450033 77529040 79639498 81767516 83932132 86142490 88416529 90764180 93187593 95671159 98186350 100695496 103171957 105599125 107983708 110350641 112737684 115169933 117649927 120160571 122682818 125189655 127657862 130088709 132478077 134791598 136986429 139035505 140921154 142660381 144304164 145924795 147575433 149273134 151005733 152761413 154517385 156256287 157977151 159685421 161376713 163046173 164689383
Bulgaria BGR Population, total SP.POP.TOTL 7867374 7943118 8012946 8078145 8144340 8204168 8258057 8310226 8369603 8434172 8489574 8536395 8576200 8620967 8678745 8720742 8758599 8804183 8814032 8825940 8861535 8891117 8917457 8939738 8960679 8960547 8958171 8971359 8981446 8876972 8718289 8632367 8540164 8472313 8443591 8406067 8362826 8312068 8256786 8210624 8170172 8009142 7837161 7775327 7716860 7658972 7601022 7545338 7492561 7444443 7395599 7348328 7305888 7265115 7223938 7177991 7127822 7075947 7025037 6975761 6927288
Bahrain BHR Population, total SP.POP.TOTL 162429 167899 173140 178142 182888 187432 191785 196060 200652 206037 212607 220311 229151 239527 251908 266540 283746 303169 323468 342804 359897 374120 385953 396451 407233 419428 433478 448981 465198 481081 495927 509762 523082 536212 549590 563698 578661 594927 613697 636540 664610 697550 735140 778708 829846 889157 958423 1035924 1114645 1185075 1240864 1278153 1299942 1315029 1336073 1371853 1425793 1494077 1569440 1641164 1701583
Bahamas, The BHS Population, total SP.POP.TOTL 109532 115119 121092 127340 133705 140060 146381 152621 158648 164265 169376 173894 177863 181519 185105 188895 192903 197100 201482 205991 210589 215321 220182 225095 229913 234579 239020 243265 247451 251738 256227 261007 266028 271065 275849 280179 283980 287363 290600 294063 298045 302618 307657 313123 318893 324848 331032 337387 343680 349600 354936 359583 363581 367162 370625 374200 377923 381749 385635 389486 393248
Bosnia and Herzegovina BIH Population, total SP.POP.TOTL 3225664 3288604 3353228 3417573 3478999 3535632 3586630 3632678 3675448 3717476 3760536 3805286 3851153 3897259 3942221 3985107 4025271 4063188 4100355 4138816 4179850 4222479 4265183 4307907 4350566 4392135 4435925 4479516 4509462 4507819 4463422 4369320 4233673 4078940 3936527 3829049 3764419 3736070 3734338 3743353 3751176 3755514 3759389 3762179 3764194 3765332 3765422 3762791 3754261 3735945 3705478 3661173 3604972 3542598 3482106 3429362 3386263 3351534 3323929 3300998 3280815
Belarus BLR Population, total SP.POP.TOTL 8198000 8271216 8351928 8437232 8524224 8610000 8696496 8785648 8874552 8960304 9040000 9115576 9188968 9257272 9317584 9367000 9411000 9463000 9525000 9584000 9643000 9710000 9776000 9843000 9910000 9975000 10043000 10111000 10140000 10170000 10189000 10194000 10216000 10239000 10227000 10194000 10160000 10117000 10069000 10026738 9979610 9928549 9865548 9796749 9730146 9663915 9604924 9560953 9527985 9506765 9490583 9473172 9464495 9465997 9474511 9489616 9501534 9498264 9483499 9417849 9398861
Belize BLZ Population, total SP.POP.TOTL 92068 94700 97392 100165 103069 106120 109348 112707 116065 119269 122184 124795 127152 129294 131305 133264 135143 136990 138975 141302 144142 147572 151486 155820 160341 164918 169569 174333 179023 183471 187554 191136 194324 197625 201679 206962 213660 221575 230248 238979 247310 255068 262387 269428 276516 283798 291338 299031 306822 314655 322465 330236 338001 345707 353366 360926 368399 375775 383071 390351 397621
Bermuda BMU Population, total SP.POP.TOTL 44400 45500 46600 47700 48900 50100 51000 52000 53000 54000 55000 54600 54200 53800 53400 53000 53200 53400 53600 53800 54670 55050 55449 55930 56423 56898 57382 57849 58347 58841 59326 59021 58595 58910 59320 59746 60129 60497 60943 61285 61833 62504 62912 63325 63740 64154 64523 64888 65273 65636 65124 64564 64798 65001 65138 65237 64554 63873 63919 63913 63903
Bolivia BOL Population, total SP.POP.TOTL 3656961 3728954 3802996 3879191 3957759 4038865 4122517 4208683 4297522 4389248 4484004 4581752 4682392 4785916 4892294 5001413 5113458 5228253 5344946 5462413 5579932 5697098 5814344 5932812 6054126 6179460 6309129 6442824 6580318 6721117 6864839 7011456 7160917 7312857 7466792 7622334 7779268 7937453 8096761 8257066 8418270 8580244 8742822 8905820 9069044 9232301 9395449 9558438 9721457 9884790 10048597 10212951 10377677 10542375 10706517 10869732 11031822 11192853 11353140 11513102 11673029
Brazil BRA Population, total SP.POP.TOTL 72179235 74311338 76514329 78772647 81064572 83373533 85696502 88035815 90387079 92746607 95113265 97482928 99859388 102259497 104706193 107216209 109790943 112425392 115121158 117878412 120694012 123570327 126498322 129448815 132383569 135274083 138108915 140891606 143627505 146328305 149003225 151648007 154259382 156849086 159432717 162019889 164614682 167209046 169785253 172318674 174790339 177196051 179537523 181809244 184006479 186127108 188167353 190130445 192030362 193886505 195713637 197514541 199287292 201035904 202763744 204471759 206163056 207833825 209469320 211049519 212559409
Barbados BRB Population, total SP.POP.TOTL 230985 231718 232623 233632 234588 235415 236084 236661 237241 237963 238895 240093 241523 243076 244643 246158 247584 248931 250200 251347 252388 253296 254078 254791 255493 256260 257117 258012 258970 259961 260933 261912 262890 263869 264893 265955 267047 268183 269334 270455 271511 272494 273423 274331 275283 276320 277475 278701 279946 281107 282131 282987 283698 284294 284825 285327 285798 286229 286640 287021 287371
Brunei Darussalam BRN Population, total SP.POP.TOTL 81707 85560 89484 93540 97819 102390 107274 112446 117897 123596 129530 135672 142015 148516 155069 161635 168173 174717 181201 187596 193880 200027 206064 212073 218176 224440 230917 237565 244405 251456 258714 266208 273888 281684 289452 297112 304620 311962 319135 326214 333166 340037 346777 353295 359434 365112 370262 374967 379418 383902 388634 393687 398997 404414 409778 414914 419791 424481 428960 433296 437483
Bhutan BTN Population, total SP.POP.TOTL 223284 228849 234552 240529 246961 253993 261664 269944 278731 287886 297307 306957 316822 326986 337491 348395 359721 371424 383324 395192 406883 418107 428938 439823 451471 464264 478683 494313 509532 522173 530801 534637 534525 532590 531905 534629 541471 551713 564378 577886 591014 603643 616025 627840 638809 648744 657404 664873 671611 678329 685502 693297 701582 710235 719053 727885 736706 745563 754396 763094 771612
Botswana BWA Population, total SP.POP.TOTL 502733 512688 523777 535692 547870 559996 571957 584098 596946 611297 627714 646350 667096 689906 714701 741346 769982 800532 832467 865073 897860 930412 962859 996124 1031439 1069585 1110948 1154904 1200073 1244484 1286756 1326321 1363541 1399110 1434061 1469173 1504724 1540424 1575827 1610260 1643333 1674674 1704637 1734387 1765533 1799077 1835911 1875458 1915636 1953495 1987106 2015406 2039551 2062551 2088619 2120716 2159925 2205076 2254067 2303703 2351625
Central African Republic CAF Population, total SP.POP.TOTL 1501668 1526057 1551908 1579375 1608618 1639706 1673019 1708306 1744198 1778870 1811157 1840517 1867786 1894850 1924386 1958367 1997017 2039914 2087662 2140778 2199359 2264441 2335339 2408322 2478382 2542170 2597765 2646836 2693974 2745735 2806740 2878507 2959236 3046148 3135017 3222662 3308235 3392432 3475485 3558019 3640421 3722016 3802129 3881185 3959883 4038380 4118075 4198004 4273368 4337623 4386765 4418639 4436411 4447945 4464171 4493171 4537683 4596023 4666375 4745179 4829764
Canada CAN Population, total SP.POP.TOTL 17909009 18271000 18614000 18964000 19325000 19678000 20048000 20412000 20744000 21028000 21324000 21962032 22218463 22491777 22807969 23143275 23449808 23725843 23963203 24201544 24515667 24819915 25116942 25366451 25607053 25842116 26100278 26446601 26791747 27276781 27691138 28037420 28371264 28684764 29000663 29302311 29610218 29905948 30155173 30401286 30685730 31020902 31360079 31644028 31940655 32243753 32571174 32889025 33247118 33628895 34004889 34339328 34714222 35082954 35437435 35702908 36109487 36545295 37065178 37593384 38005238
Central Europe and the Baltics CEB Population, total SP.POP.TOTL 91401764 92232738 93009498 93840016 94715795 95440988 96146336 97043270 97884022 98606630 99134548 99635258 100357161 101112680 101939916 102860571 103776068 104616884 105329397 105948616 106541316 107129392 107730380 108297837 108838073 109338285 109824166 110296425 110686740 110801640 110743128 110469467 110111454 110041924 110021594 109864246 109626194 109422013 109238340 109060951 108447824 107660041 106959751 106624167 106331716 106041911 105772481 105378748 105001883 104800475 104421447 104174038 103935318 103713726 103496179 103257886 102994278 102740078 102538451 102398537 102246330
Switzerland CHE Population, total SP.POP.TOTL 5327827 5434294 5573815 5694247 5789228 5856472 5918002 5991785 6067714 6136387 6180877 6213399 6260956 6307347 6341405 6338632 6302504 6281174 6281738 6294365 6319408 6354074 6391309 6418773 6441865 6470365 6504124 6545106 6593386 6646912 6715519 6799978 6875364 6938265 6993795 7040687 7071850 7088906 7110001 7143991 7184250 7229854 7284753 7339001 7389625 7437115 7483934 7551117 7647675 7743831 7824909 7912398 7996861 8089346 8188649 8282396 8373338 8451840 8514329 8575280 8636896
Channel Islands CHI Population, total SP.POP.TOTL 109419 110398 111464 112591 113777 114989 116229 117469 118725 119975 121200 122405 123611 124722 125680 126420 126908 127186 127395 127691 128204 128986 129976 131153 132445 133800 135242 136762 138250 139580 140677 141469 142006 142407 142867 143480 144348 145387 146500 147553 148439 149097 149591 150070 150722 151674 152999 154644 156441 158187 159718 160990 162051 163039 164107 165387 166922 168666 170496 172264 173859
Chile CHL Population, total SP.POP.TOTL 8132988 8303804 8476895 8650390 8821855 8989607 9152848 9312091 9468851 9625304 9783134 9942716 10103675 10265827 10428803 10592310 10756876 10922777 11089165 11254877 11419350 11582020 11743909 11907955 12078137 12257238 12445833 12642917 12847712 13058758 13274617 13495255 13719818 13944934 14166346 14380864 14587367 14786227 14977736 15162801 15342350 15516112 15684413 15849649 16014972 16182713 16354507 16530201 16708255 16886184 17062531 17233584 17400359 17571511 17758969 17969356 18209072 18470435 18729166 18952035 19116209
China CHN Population, total SP.POP.TOTL 667070000 660330000 665770000 682335000 698355000 715185000 735400000 754550000 774510000 796025000 818315000 841105000 862030000 881940000 900350000 916395000 930685000 943455000 956165000 969005000 981235000 993885000 1008630000 1023310000 1036825000 1051040000 1066790000 1084035000 1101630000 1118650000 1135185000 1150780000 1164970000 1178440000 1191835000 1204855000 1217550000 1230075000 1241935000 1252735000 1262645000 1271850000 1280400000 1288400000 1296075000 1303720000 1311020000 1317885000 1324655000 1331260000 1337705000 1344130000 1350695000 1357380000 1364270000 1371220000 1378665000 1386395000 1392730000 1397715000 1402112000
Cote d’Ivoire CIV Population, total SP.POP.TOTL 3503559 3631547 3770756 3918630 4071411 4226843 4383723 4544168 4713134 4897470 5102070 5328735 5576026 5841513 6121295 6412409 6713949 7026497 7350269 7685927 8033652 8393689 8764991 9144940 9530103 9918204 10307268 10698188 11094740 11502453 11924873 12362404 12812428 13271638 13735438 14199759 14665125 15130674 15589407 16032573 16454660 16853027 17231539 17599613 17970493 18354513 18754914 19171250 19605568 20059147 20532944 21028652 21547188 22087506 22647672 23226148 23822726 24437475 25069226 25716554 26378275
Cameroon CMR Population, total SP.POP.TOTL 5176920 5285015 5398730 5518104 5643039 5773538 5909874 6052419 6201410 6357096 6519754 6689659 6867170 7052847 7247284 7451057 7664398 7887571 8121081 8365560 8621409 8888534 9166813 9456496 9757849 10070806 10395481 10731058 11075423 11425807 11780086 12137912 12499499 12864091 13230978 13599984 13970812 14344444 14723772 15112598 15513944 15928910 16357605 16800869 17259322 17733408 18223677 18730283 19252674 19789922 20341236 20906392 21485267 22077300 22681853 23298376 23926549 24566070 25216261 25876387 26545864
Congo, Dem. Rep.  COD Population, total SP.POP.TOTL 15248256 15637700 16041187 16461828 16903830 17369882 17862052 18378620 18913874 19459818 20011033 20564062 21121360 21690448 22282127 22903587 23560470 24249127 24956387 25663598 26358905 27040329 27717293 28403858 29119663 29881222 30683877 31528702 32443784 33464767 34612023 35908240 37333917 38815835 40252973 41576239 42757239 43827191 44849968 45919615 47105830 48428534 49871670 51425583 53068869 54785894 56578046 58453687 60411195 62448572 64563853 66755151 69020749 71358804 73767445 76244532 78789130 81398765 84068092 86790568 89561404
Congo, Rep.  COG Population, total SP.POP.TOTL 1018254 1043119 1069236 1096638 1125354 1155389 1186782 1219547 1253761 1289519 1326894 1365891 1406510 1448632 1492055 1536658 1582361 1629218 1677326 1726865 1777932 1830629 1884873 1940454 1996994 2054308 2112359 2171319 2231462 2293161 2356740 2422312 2489945 2559880 2632345 2707532 2785815 2867283 2951651 3038432 3127420 3217930 3310376 3406915 3510468 3622775 3745143 3876123 4011487 4145400 4273738 4394842 4510197 4622757 4736965 4856093 4980996 5110701 5244363 5380504 5518092
Colombia COL Population, total SP.POP.TOTL 16057714 16567817 17092919 17629978 18175187 18725242 19279734 19837508 20393704 20942453 21480064 22003983 22516429 23024512 23538390 24065502 24608102 25164544 25733669 26312996 26900508 27496608 28101824 28714183 29331230 29951194 30572479 31195417 31822527 32457497 33102569 33758328 34422568 35091272 35758978 36421438 37076387 37723803 38364307 38999468 39629965 40255956 40875363 41483872 42075953 42647731 43200901 43737512 44254972 44750054 45222699 45662747 46075721 46495492 46967706 47520667 48175048 48909844 49661056 50339443 50882884
Comoros COM Population, total SP.POP.TOTL 191122 194149 197202 200371 203762 207429 211477 215900 220574 225324 230055 234645 239229 244207 250107 257285 265958 275903 286628 297451 307831 317617 326944 336088 345455 355337 365765 376647 387964 399638 411598 423873 436454 449270 462280 475394 488625 501955 515382 528853 542358 555895 569480 583213 597230 611625 626427 641624 657227 673251 689696 706578 723865 741511 759390 777435 795597 813890 832322 850891 869595
Cabo Verde CPV Population, total SP.POP.TOTL 201770 205321 210141 216087 222949 230421 238655 247522 256169 263461 268633 271315 271841 271068 270228 270240 271345 273335 276182 279729 283848 288678 294244 300226 306140 311668 316613 321137 325744 331180 337953 346229 355763 366057 376409 386288 395533 404248 412513 420456 428178 435701 442955 449925 456619 463034 469171 475067 480846 486667 492644 498858 505241 511740 518276 524740 531140 537499 543764 549936 555988
Costa Rica CRI Population, total SP.POP.TOTL 1330787 1381187 1433346 1486555 1539942 1592834 1645076 1696742 1747690 1797891 1847394 1896074 1944170 1992522 2042242 2094186 2148681 2205618 2264942 2326460 2389973 2455593 2523358 2593015 2664220 2736719 2810245 2884856 2960932 3039015 3119436 3202083 3286525 3372298 3458829 3545524 3632361 3718952 3803893 3885428 3962369 4034074 4100922 4164053 4225156 4285504 4345421 4404626 4463123 4520739 4577371 4633086 4688003 4742111 4795390 4847805 4899336 4949955 4999443 5047561 5094114
Caribbean small states CSS Population, total SP.POP.TOTL 4194711 4274052 4353623 4432240 4508189 4580382 4648353 4712556 4773890 4833839 4893444 4953088 5012600 5071912 5130797 5189174 5246547 5303289 5360562 5419911 5482191 5548521 5617902 5687113 5751508 5808206 5855445 5894820 5930153 5967028 6009262 6058334 6112784 6170324 6227332 6281204 6331277 6378551 6423851 6468461 6513486 6559097 6604966 6650981 6696985 6742657 6788122 6833394 6878973 6925446 6973193 7022367 7072640 7123315 7173443 7222197 7269385 7314956 7358929 7401389 7442291

3.2.2 Reshaping with gather()

df_pop <- raw_pop %>% 
  gather(year, population, `1960`:`2020`) 
Table 3.3: WDI Population Data (Long)
Country Name Country Code Indicator Name Indicator Code year population
Aruba ABW Population, total SP.POP.TOTL 1960 54208
Africa Eastern and Southern AFE Population, total SP.POP.TOTL 1960 130836765
Afghanistan AFG Population, total SP.POP.TOTL 1960 8996967
Africa Western and Central AFW Population, total SP.POP.TOTL 1960 96396419
Angola AGO Population, total SP.POP.TOTL 1960 5454938
Albania ALB Population, total SP.POP.TOTL 1960 1608800
Andorra AND Population, total SP.POP.TOTL 1960 13410
Arab World ARB Population, total SP.POP.TOTL 1960 92197715
United Arab Emirates ARE Population, total SP.POP.TOTL 1960 92417
Argentina ARG Population, total SP.POP.TOTL 1960 20481781
Armenia ARM Population, total SP.POP.TOTL 1960 1874119
American Samoa ASM Population, total SP.POP.TOTL 1960 20127
Antigua and Barbuda ATG Population, total SP.POP.TOTL 1960 54132
Australia AUS Population, total SP.POP.TOTL 1960 10276477
Austria AUT Population, total SP.POP.TOTL 1960 7047539
Azerbaijan AZE Population, total SP.POP.TOTL 1960 3895398
Burundi BDI Population, total SP.POP.TOTL 1960 2797925
Belgium BEL Population, total SP.POP.TOTL 1960 9153489
Benin BEN Population, total SP.POP.TOTL 1960 2431617
Burkina Faso BFA Population, total SP.POP.TOTL 1960 4829289
Bangladesh BGD Population, total SP.POP.TOTL 1960 48013505
Bulgaria BGR Population, total SP.POP.TOTL 1960 7867374
Bahrain BHR Population, total SP.POP.TOTL 1960 162429
Bahamas, The BHS Population, total SP.POP.TOTL 1960 109532
Bosnia and Herzegovina BIH Population, total SP.POP.TOTL 1960 3225664
Belarus BLR Population, total SP.POP.TOTL 1960 8198000
Belize BLZ Population, total SP.POP.TOTL 1960 92068
Bermuda BMU Population, total SP.POP.TOTL 1960 44400
Bolivia BOL Population, total SP.POP.TOTL 1960 3656961
Brazil BRA Population, total SP.POP.TOTL 1960 72179235
Barbados BRB Population, total SP.POP.TOTL 1960 230985
Brunei Darussalam BRN Population, total SP.POP.TOTL 1960 81707
Bhutan BTN Population, total SP.POP.TOTL 1960 223284
Botswana BWA Population, total SP.POP.TOTL 1960 502733
Central African Republic CAF Population, total SP.POP.TOTL 1960 1501668
Canada CAN Population, total SP.POP.TOTL 1960 17909009
Central Europe and the Baltics CEB Population, total SP.POP.TOTL 1960 91401764
Switzerland CHE Population, total SP.POP.TOTL 1960 5327827
Channel Islands CHI Population, total SP.POP.TOTL 1960 109419
Chile CHL Population, total SP.POP.TOTL 1960 8132988
China CHN Population, total SP.POP.TOTL 1960 667070000
Cote d’Ivoire CIV Population, total SP.POP.TOTL 1960 3503559
Cameroon CMR Population, total SP.POP.TOTL 1960 5176920
Congo, Dem. Rep.  COD Population, total SP.POP.TOTL 1960 15248256
Congo, Rep.  COG Population, total SP.POP.TOTL 1960 1018254
Colombia COL Population, total SP.POP.TOTL 1960 16057714
Comoros COM Population, total SP.POP.TOTL 1960 191122
Cabo Verde CPV Population, total SP.POP.TOTL 1960 201770
Costa Rica CRI Population, total SP.POP.TOTL 1960 1330787
Caribbean small states CSS Population, total SP.POP.TOTL 1960 4194711

3.2.3 Reshaping with melt()

df_pop2 <- raw_pop %>% 
  reshape2::melt(id.vars=c('Country Name',
                          "Indicator Name",
                          "Indicator Code",
                          'Country Code'))

3.2.4 Transforming Variables with mutate()

The countrycode package allows us to create several country codes (e.g. ISO3) from country ID variables (e.g. the name in English). Below, we mutate() a new variable we will call country_code and use the countrycode to output ISO2 codes from the 3 letter codes used by the World Bank. Finally, we select() only the variables we want.

library(countrycode)
df_pop <- df_pop %>% 
  # We mutate() a new variable
  mutate(country_code=countrycode(`Country Code`, # A variable in our df
                                  origin = 'wb', #The WB 3-letter classification input
                                  destination = 'iso2c'#The ISO 2 character output
                                  )) %>% 
  select(country_code, year, population)
Table 3.4: Mutating an ISO2 country code
country_code year gdp_pc_usd
FR 2020 39030.36
FR 2019 40578.64
FR 2018 41572.49
FR 2017 38685.26
FR 2016 37037.37
FR 2015 36638.18
FR 2014 43011.26
FR 2013 42592.94
FR 2012 40874.70
FR 2011 43790.73
FR 2010 40638.34
FR 2009 41575.42
FR 2008 45334.12
FR 2007 41508.43
FR 2006 36443.63
FR 2005 34760.19
FR 2004 33741.26
FR 2003 29568.39
FR 2002 24177.34
FR 2001 22433.55
FR 2000 22364.03
FR 1999 24673.20
FR 1998 24974.27
FR 1997 24228.95
FR 1996 26871.83
FR 1995 26890.22
DE 2020 46208.43
DE 2019 46794.90
DE 2018 47950.18
DE 2017 44542.30
DE 2016 42107.52
DE 2015 41086.73
DE 2014 47959.99
DE 2013 46285.76
DE 2012 43858.36
DE 2011 46644.78
DE 2010 41531.93
DE 2009 41485.90
DE 2008 45427.15
DE 2007 41587.21
DE 2006 36323.45
DE 2005 34507.37
DE 2004 34044.05
DE 2003 30243.58
DE 2002 25077.73
DE 2001 23607.88
DE 2000 23635.93
DE 1999 26725.92
DE 1998 27289.06
DE 1997 26964.05

3.2.5 Merging data sets with left_join

Now let´s get another data set from WDI, this time on GDP in current USD or NY.GDP.MKTP.CD based on the World Bank’s nomenclature. Below, we store our raw data in the object raw_gdp and then create a clean one (df_gdp) where we select() three variables, rename() two of them, and finally filter() for the rows where the values of the column country_code are in the vector c("FR","DE","ES","HU","IT","GR") and where the year is in the range 1995:2020.

raw_gdp <- WDI(indicator = "NY.GDP.MKTP.CD") 

df_gdp <- raw_gdp %>% 
  select(iso2c, year, NY.GDP.MKTP.CD) %>% 
  rename(gdp_usd = NY.GDP.MKTP.CD, 
         country_code = iso2c) %>% 
  filter(country_code %in% c("FR","DE","ES","HU","IT","GR"),
         year %in% 1995:2020)

This results in a new data set df_gdp which has 156 rows and 3 columns which correspond to the variables: country_code, year, gdp_usd. Now lets try to merge this with our data on population stored in df_wdi. Before we merge them together, however, we need to consider some steps:

  • Are our data sets tidy?
  • Do our data sets have overlapping variables?
  • Are the overlapping variables of the same class()?
  • Do these variables have overlapping values?
  • Do the number of observations match?
  • If not, which data set is more important?
  • If neither, which is longer?
Step Data Set 1 Data Set 2
Name df_gdp df_pop
Is it tidy? TRUE TRUE
Number of Variables 3 3
Number of Rows 156 16226
Main Data Set TRUE FALSE
Overlapping Variable(s) countr_code year countr_code year
O. Var. 1 class character character
O. Var. 2 class integer character

We know from the table above that our two data sets are tidy (because we tidied them ourselves above), that we have overlapping variables, and that we care more about df_gdp despite it being shorter. However, in the final row of our checklist table, you will notice a small issue. We have a variable year in both data sets, but they are not of the same class(). This will create some problems if we try to merge our data frames.

df_new <- left_join(df_gdp, df_pop)
## Joining, by = c("country_code", "year")
## Error: Can't join on `x$year` x `y$year` because of incompatible types.
## i `x$year` is of type <integer>>.
## i `y$year` is of type <character>>.

This error is easily fixed, let’s simply overwrite our df_pop$year variable as.integer().

df_pop$year <- as.integer(df_pop$year)

Now that this variable is an integer, we can use the command left_join() to join our main data set or Master df_gdp on the left with the supplementary one gdp_pop on the right and save the output in a new data set called df_new.

df_new <- left_join(df_gdp, df_pop)
## Joining, by = c("country_code", "year")

Notice that we kept the number of observations of our master data set df_gdp, so our merged data set has 156 rows. Now that our data is both tidy and merged, we can do some plotting.

Table 3.5: Our Merged Data Frame
country_code year gdp_usd population
FR 2020 2630317731455 67391582
FR 2019 2728870246706 67248926
FR 2018 2789593979065 67101930
FR 2017 2588740901640 66918020
FR 2016 2471285607082 66724104
FR 2015 2438207896252 66548272
FR 2014 2852165760630 66312067
FR 2013 2811077725704 65998685
FR 2012 2683825225093 65659814
FR 2011 2861408170265 65342789
FR 2010 2642609548930 65027505
FR 2009 2690222283968 64707035
FR 2008 2918382891460 64374979
FR 2007 2657213249384 64016227
FR 2006 2318593651988 63621376
FR 2005 2196126103718 63179356
FR 2004 2115742488205 62704901
FR 2003 1840480812641 62244880
FR 2002 1494286655374 61805266
FR 2001 1376465324385 61357432
FR 2000 1362248940483 60912500
FR 1999 1492647560196 60496708
FR 1998 1503108739159 60186284
FR 1997 1452884917959 59964841
FR 1996 1605675086550 59753095
FR 1995 1601094756210 59541904
DE 2020 3846413928654 83240525
DE 2019 3888326788627 83092962
DE 2018 3975347237443 82905782
DE 2017 3681732583769 82657002
DE 2016 3467498002104 82348669
DE 2015 3356235704120 81686611
DE 2014 3883920155292 80982500
DE 2013 3732743446219 80645605
DE 2012 3527344944140 80425823
DE 2011 3744408602684 80274983
DE 2010 3396354075664 81776930
DE 2009 3397791053070 81902307
DE 2008 3730027830672 82110097
DE 2007 3421229126745 82266372
DE 2006 2992196713085 82376451
DE 2005 2845802760851 82469422
DE 2004 2809187981127 82516260
DE 2003 2496128668172 82534176
DE 2002 2068624129494 82488495
DE 2001 1944107382550 82349925
DE 2000 1943145384190 82211508
DE 1999 2194204133816 82100243
DE 1998 2238990774703 82047195
DE 1997 2211989623280 82034771
df_new %>%
  filter(country_code %in% c("DE", "IT", "ES", "FR")) %>% 
  mutate(population = population/1000000,
         gdp_usd = gdp_usd/1000000000) %>% 
  ggplot(aes(x = year))+
  geom_line(aes(group = country_code,
                y = gdp_usd,
                color = "GDP USD"),
            size = 1)+
  scale_color_manual(values = cb_palette)+
  scale_y_continuous(labels = scales::dollar_format(),
                     sec.axis = sec_axis(~ . / 100, 
                                         name = "Population in Millions",
                                         labels = scales::comma))+
  geom_line(aes(group = country_code,
                y = population * 100,
                color = "Population"),
            size = 1)+
  labs(title = "Population and GDP in Europe",
       y = "GDP in Billions",
       x=NULL,
       color="Indicator:")+
  facet_wrap(~country_code)+
  theme_classic()+
  theme(legend.position = "bottom")

3.3 Text cleaning with stringr

The stringr package is included with the tidyverse and offers a wide array of functions that will help us work with text data or strings.

text_vector <- c("Mr. John", "Mrs. Laura", "Sr. Frogs",
                 "Srta. Penelope", "Mr. Karl", "Sra. Mirza 123",
                 "Hans")

class(text_vector)
## [1] "character"

3.3.1 General cleaning

str_to_lower(text_vector)
## [1] "mr. john"       "mrs. laura"     "sr. frogs"      "srta. penelope"
## [5] "mr. karl"       "sra. mirza 123" "hans"
str_remove_all(text_vector, "[:punct:]")
## [1] "Mr John"       "Mrs Laura"     "Sr Frogs"      "Srta Penelope"
## [5] "Mr Karl"       "Sra Mirza 123" "Hans"
str_remove_all(text_vector, "[:digit:]")
## [1] "Mr. John"       "Mrs. Laura"     "Sr. Frogs"      "Srta. Penelope"
## [5] "Mr. Karl"       "Sra. Mirza "    "Hans"
str_remove_all(text_vector, "[:punct:]|[:digit:]")
## [1] "Mr John"       "Mrs Laura"     "Sr Frogs"      "Srta Penelope"
## [5] "Mr Karl"       "Sra Mirza "    "Hans"
str_length(text_vector)
## [1]  8 10  9 14  8 14  4

3.3.2 Matching

str_detect(text_vector, "Mr")
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
str_detect(text_vector, "Mr[:punct:]")
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
str_detect(text_vector, "Mr(?!s)")
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
str_detect(text_vector, "(Mr|Mrs)[:punct:]")
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE

3.3.3 Mutating Strings

str_remove_all(text_vector, "Mr")
## [1] ". John"         "s. Laura"       "Sr. Frogs"      "Srta. Penelope"
## [5] ". Karl"         "Sra. Mirza 123" "Hans"
str_remove_all(text_vector, "Mr. ")
## [1] "John"           "Mrs. Laura"     "Sr. Frogs"      "Srta. Penelope"
## [5] "Karl"           "Sra. Mirza 123" "Hans"
str_remove_all(text_vector, "Mr. |Mrs. |Sra. |Srta. |Sr. ")
## [1] "John"      "Laura"     "Frogs"     "Penelope"  "Karl"      "Mirza 123"
## [7] "Hans"
str_sub(text_vector, 1,3)
## [1] "Mr." "Mrs" "Sr." "Srt" "Mr." "Sra" "Han"
str_extract(text_vector, "Mr. |Mrs. |Sra. |Srta. |Sr. ")
## [1] "Mr. "   "Mrs. "  "Sr. "   "Srta. " "Mr. "   "Sra. "  NA

3.3.4 Building a data set

df <- as.data.frame(text_vector)
df$language <- case_when(
  str_detect(df$text_vector, "(Mr|Mrs)[:punct:]")~"English",
  str_detect(df$text_vector, "(Sr|Sra|Srta)[:punct:]")~"Spanish",
  TRUE~NA_character_
)
## [1] "English" "English" "Spanish" "Spanish" "English" "Spanish" NA
df <- df %>% 
  mutate(
    gender = case_when(
      str_detect(text_vector, "(Mr|Sr)[:punct:]")~"Male",
      str_detect(text_vector, "(Sra|Srta|Mrs)[:punct:]")~"Female",
      TRUE~"Unknown"      
    ),
    first_name = str_remove_all(text_vector, "Mr. |Mrs. |Sra. |Srta. |Sr. |[:digit:]")
  )
text_vector language gender first_name
Mr. John English Male John
Mrs. Laura English Female Laura
Sr. Frogs Spanish Male Frogs
Srta. Penelope Spanish Female Penelope
Mr. Karl English Male Karl
Sra. Mirza 123 Spanish Female Mirza
Hans NA Unknown Hans

  1. Keep in mind that APIs or application programming interfaces are fickle things, the data they offer and the requirements may -alas- change without warning.]↩︎