9.1 Lab 9: Basics of character encoding in R

There is a very good short guide by Ista Zahn: Escaping from character encoding hell in R on Windows. Below we quickly talk through a couple of things.

Detecting the encoding of your system.

Sys.getlocale(category = "LC_CTYPE")

##  Find details of the numerical and monetary representations in
##  the current locale
Sys.localeconv() # 

Let’s now consider some text in German that contains a non-ASCII character.

# some text in German
de <- "Einbahnstraße"
# all good! Ahhhh NOT!
message(de)
de

But what if the file is not in UTF-8 and when we save it and re-open it, it looks like this? As long as we set the right encoding, we can switch back and forth.

de <- "Einbahnstra\u00dfe"
Encoding(de)
message(de)

# here we assign the wrong encoding
Encoding(de) <- "latin1"
message(de)

# now back to the right encoding
Encoding(de) <- "UTF-8"
message(de)

We can also use the stringi package to fix this

library(stringi)
stri_unescape_unicode("Einbahnstra\u00dfe")
stri_unescape_unicode("Einbahnstraße")

If you want to translate a string from one encoding scheme to another in a single line of code, you can use iconv:

de <- "Einbahnstra\xdfe"
Encoding(de)
iconv(de, from="windows-1252", to="UTF-8")
de <- "Einbahnstra\u00dfe"
de <- iconv(de, from="UTF-8", to="latin1")

You’re probably wondering now - how do we know the encoding of some text we want to analyze? Good question! Turns out it’s a hard problem, but we can use the guess_encoding question in the rvest package (which uses stri_enc_detect in the stringi package) to try to figure that out…

library(rvest)
de <- "Einbahnstra\xdfe"
stri_enc_detect(de)
guess_encoding(de)
iconv(de, from="ISO-8859-1", to="UTF-8")

de <- "Einbahnstra\u00dfe"
stri_enc_detect(de)
guess_encoding(de)
message(de) # no need for translation!

x <- data.frame(text = c("Einbahnstra\xdfe", "Einbahnstra\u00dfe"))

The same applies to websites… (Although you can also check the <meta> tag for clues.)

url <- "http://www.presidency.ucsb.edu/ws/index.php?pid=96348"
guess_encoding(url)
url <- "http://www.spiegel.de"
guess_encoding(url)
url <- "http://www.elpais.es"
guess_encoding(url)