9.1 Lab 9: Basics of character encoding in R
There is a very good short guide by Ista Zahn: Escaping from character encoding hell in R on Windows. Below we quickly talk through a couple of things.
Detecting the encoding of your system.
Sys.getlocale(category = "LC_CTYPE") ## Find details of the numerical and monetary representations in ## the current locale Sys.localeconv() #
Let’s now consider some text in German that contains a non-ASCII character.
# some text in German de <- "Einbahnstraße" # all good! Ahhhh NOT! message(de) de
But what if the file is not in UTF-8 and when we save it and re-open it, it looks like this? As long as we set the right encoding, we can switch back and forth.
de <- "Einbahnstra\u00dfe" Encoding(de) message(de) # here we assign the wrong encoding Encoding(de) <- "latin1" message(de) # now back to the right encoding Encoding(de) <- "UTF-8" message(de)
We can also use the
stringi package to fix this
library(stringi) stri_unescape_unicode("Einbahnstra\u00dfe") stri_unescape_unicode("Einbahnstraße")
If you want to translate a string from one encoding scheme to another in a single line of code, you can use
de <- "Einbahnstra\xdfe" Encoding(de) iconv(de, from="windows-1252", to="UTF-8") de <- "Einbahnstra\u00dfe" de <- iconv(de, from="UTF-8", to="latin1")
You’re probably wondering now - how do we know the encoding of some text we want to analyze? Good question! Turns out it’s a hard problem, but we can use the
guess_encoding question in the
rvest package (which uses
stri_enc_detect in the
stringi package) to try to figure that out…
library(rvest) de <- "Einbahnstra\xdfe" stri_enc_detect(de) guess_encoding(de) iconv(de, from="ISO-8859-1", to="UTF-8") de <- "Einbahnstra\u00dfe" stri_enc_detect(de) guess_encoding(de) message(de) # no need for translation! x <- data.frame(text = c("Einbahnstra\xdfe", "Einbahnstra\u00dfe"))
The same applies to websites… (Although you can also check the
<meta> tag for clues.)
url <- "http://www.presidency.ucsb.edu/ws/index.php?pid=96348" guess_encoding(url) url <- "http://www.spiegel.de" guess_encoding(url) url <- "http://www.elpais.es" guess_encoding(url)