9.1 Lab 9: Basics of character encoding in R
There is a very good short guide by Ista Zahn: Escaping from character encoding hell in R on Windows. Below we quickly talk through a couple of things.
Detecting the encoding of your system.
Sys.getlocale(category = "LC_CTYPE")
## Find details of the numerical and monetary representations in
## the current locale
Sys.localeconv() #
Let’s now consider some text in German that contains a non-ASCII character.
# some text in German
de <- "Einbahnstraße"
# all good! Ahhhh NOT!
message(de)
de
But what if the file is not in UTF-8 and when we save it and re-open it, it looks like this? As long as we set the right encoding, we can switch back and forth.
de <- "Einbahnstra\u00dfe"
Encoding(de)
message(de)
# here we assign the wrong encoding
Encoding(de) <- "latin1"
message(de)
# now back to the right encoding
Encoding(de) <- "UTF-8"
message(de)
We can also use the stringi
package to fix this
library(stringi)
stri_unescape_unicode("Einbahnstra\u00dfe")
stri_unescape_unicode("Einbahnstraße")
If you want to translate a string from one encoding scheme to another in a single line of code, you can use iconv
:
de <- "Einbahnstra\xdfe"
Encoding(de)
iconv(de, from="windows-1252", to="UTF-8")
de <- "Einbahnstra\u00dfe"
de <- iconv(de, from="UTF-8", to="latin1")
You’re probably wondering now - how do we know the encoding of some text we want to analyze? Good question! Turns out it’s a hard problem, but we can use the guess_encoding
question in the rvest
package (which uses stri_enc_detect
in the stringi
package) to try to figure that out…
library(rvest)
de <- "Einbahnstra\xdfe"
stri_enc_detect(de)
guess_encoding(de)
iconv(de, from="ISO-8859-1", to="UTF-8")
de <- "Einbahnstra\u00dfe"
stri_enc_detect(de)
guess_encoding(de)
message(de) # no need for translation!
x <- data.frame(text = c("Einbahnstra\xdfe", "Einbahnstra\u00dfe"))
The same applies to websites… (Although you can also check the <meta>
tag for clues.)
url <- "http://www.presidency.ucsb.edu/ws/index.php?pid=96348"
guess_encoding(url)
url <- "http://www.spiegel.de"
guess_encoding(url)
url <- "http://www.elpais.es"
guess_encoding(url)