Automated Content Analysis
Chapter 1 Introduction
Throughout these seminars we will use R. R is an open-source programme that allows you to carry out a wide variety of statistical tasks. At its core, it is a modification of the programming languages S and Scheme, making it not only flexible but fast as well. R is available for Windows, Linux and OS X and receives regular updates. In its basic version, R uses a simple command line interface. To give it a friendlier look, environments such as RStudio and RCommander are available. Apart from looking better, these environments also provide some extra practical features. For this course, we will use RStudio. How to install R depends on your system.
1.0.1 R on Windows
To download R for Windows, go to the website (https://cran.r-project.org/bin/windows/base/), download, double-click it and run it. Whilst installing, it is best to leave standard options (such as the installation folder) unchanged. This makes it easier for other programmes to know where to find R. Once installed, you will find two shortcuts for R on your desktop. These refer to each of the two versions of R that come with the installation - the 32-bit and the 64-bit version. Which version you need depends on your version of Windows. To see which version of Windows you have, go to This PC (or My Computer, right-click it, and select Properties. Here you should find the version of Windows installed on your PC. If you have the 64-bit version of Windows, you can use both versions. Yet, it is best to use the 64-bit version as this makes better use of the memory of your computer and thus runs smoother. If you have the 32-bit version of Windows, you have to use the 32-bit version of R.
To install RStudio, again go to the website (https://www.rstudio.com/products/rstudio/download/), and download the free version of R studio at the bottom of the page. Make sure to choose Installers for Supported Platforms and pick the option for Windows. Once downloaded, install the programme, leaving all settings unchanged. If everything works out fine, RStudio will have found your installation of R and placed a shortcut on the desktop. Whether you have the 32-bit or 64-bit version of Windows or R does not matter for RStudio. What does matter are slashes. R uses forward slashes (/) instead of the backslashes (
\) that Windows uses. Thus, whenever you specify a folder or file within R, make sure to invert the slashes. So, you should refer to a file which in Windows has the address C:\Users\Desktop\data.csv as C:/Users/Desktop/data.csv.
1.0.2 R on Linux
As there are many flavours of Linux, these instructions are rather general and depend on exactly what type of OS you are running. In most cases, R is already part of your Linux distribution. You can check this by opening a terminal and typing
R. If installed, R will launch in the terminal. If R is not part of your system, you can use the Synaptic Package Manager and look for the r-base-dev and r-base packages. Select them, and install them. To install RStudio, go to https://www.rstudio.com/products/rstudio/download/. At the bottom of the page, pick the installer that corresponds to your OS. The installer file is a .deb file that should install as soon as you double-click it. After running the launcher, you can find RStudio in the Dash.
1.0.3 R on OS X
With OS X it is important that you have OS X 10.6 (Snow Leopard) or above[^12]. Installing R otherwise is still possible, but you cannot use a certain number of packages (such as some we use here. To check this, click on the Apple icon in the top-left of your screen. Then, click on the “About This Mac” options. A window should then appear that tells you which version of OS X (or macOS as it is called now) you have.). To install R, go to https://cran.r-project.org/index.html and click Download R for (Mac) OS X. Once there, download the .pkg file that corresponds to your version of OS X. Besides, download the XQuartz file listed next to the package. Install both and leave the selected options as they are. After the installation, check if R works by launching the programme. If so, go to https://www.rstudio.com/products/rstudio/download/ and download the OSX version at the bottom of the page. Download, install and check whether the programme works. If so, you are set up to go.
1.1 A Note about Packages
Before we move on, a quick word about the packages R uses. All of the packages we have used so far have been “officially” released. This means that the package has been released on CRAN - the Comprehensive R Archive Network. This network is nothing more than a website that collects and hosts all the material R needs, such as the different distributions, packages, and more. Any package released on CRAN has gone through a sort-of vetting process to ensure that the package does not contain any major bugs, has README and NEWS files, and has a clear version number. Not only does this ensure that the package will work out-of-the-box, having a package hosted on CRAN also allows us to easily install the package using the install.packages command, or the Packages tab in RStudio (as these are linked to CRAN). Also, having a package hosted on CRAN means that the package is regularly updated and will keep functioning even as other packages it might use, are changed or updated. Packages that are no longer maintained are removed from CRAN after a while and - if installed anyway - might not work anymore due to the packages it depended on being changed. This is the main reason that we stopped using the - rather well-written - RTextTools package to carry out supervised methods as it was not maintained anymore and kept crashing on the latest builds of R.
An alternative for CRAN is Github (https://github.com/). Here, developers can upload packages that are still in development and thus are not yet eligible to be released on CRAN. The advantage of Github is that it allows us to download the latest packages and newest functions, though at the price of the packages being generally unstable. Also, it is slightly more complicated to install Github packages. To do so, we require a package known as devtools. This package then allows us to install packages hosted on Github:
Note that devtools can be a difficult package to install. This is because the package was conceived for developers who are most often aware of how to tackle most installation problems. Yet, most often the problems that come with the installtion can be easily solved, depending on the system we are working with.
1.1.1 For Mac OS
To run on OSX (or macOS), devtools requires the XCode package to be installed on the computer. To do this, follow these steps:
- Launch the Terminal (which you can find in /Applications/Utilities/)
- In the Terminal, type: xcode-select –install
- A software update window should pop up. If correct, it will ask you something like: “The xcode-select command requires the command line developer tools. Would you like to install the tools now?”. Respond to this by clicking “Install” and agree to the Terms of Service.
- Wait for the download to be complete. If everything goes as it should, the installer should go away when everything is done.
- Then, go to R and run the install.packages(“devtools”) command
Besides this, XCode also allows you to build and maintain all kinds of other software and programming languages from macOS outside of R, such as C++, Java, Python, Ruby, etc. If interested, just open the XCode program in the Terminal to see what you can do with it.
1.1.2 For Windows
For Windows, the only software devtools requires is knows as RTools. To install this, simply go to their website (https://cran.r-project.org/bin/windows/Rtools/), download the latest recommended version (in green), and install it. Then re-open R again and install devtools.
1.1.3 For Linux
Whether and how devtools installs on Linux is dependent on the version of Linux that you have. Most often, you will be fine and R will simply ask you whether you wish to install any missing packages that devtools requires. Other times, the installation will fail and an error message (in red) will appear telling you which packages you are missing. Sometimes installing these packages manually and trying to install devtools again solves the problem. When this still fails, there are two options you can try. The first is to install a certain number of dependencies through the Terminal. To do so, open it, and type: sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev. Then, close the terminal, open R again, and try the installation of devtools. A simpler option is to use the Synaptic Package Manager (which makes installing packages in Ubuntu that much easier) and searching for cran-devtools or just devtools. Then simply select the package, click install, and wait until everything is done. When you then return to R, devtools should be installed.
1.2 More on Installing Packages
In this example, we will use a method known as Support Vector Machines. This method is not (yet) included in quanteda, but is included in quanteda.classifiers, a future extension of quanteda that is currently under development. As such, it is not hosted on CRAN yet and we have to download it from Github. To do so, type:
## Package version: 1.5.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## ## Attaching package: 'quanteda'
## The following object is masked from 'package:utils': ## ## View
## Loading required package: usethis
Besides this standard way of installation, there is an alternative way we can install these packages. This is by using the “::”, or double colons, operators:
Not only does this make the code shorter (which is most often a plus), it also prevents R from loading the devtools package as a whole. Instead, what “::” does is call from the devtools package the install_github function directly and carry it out, without loading the devtools package. This might come in handy when we want to be very clear of which function we are using. The reason for this is that often different packages in R use the same commands. For example, both the quanteda and the utils package have a function called View. When this happens, we see the following message when we load the package: “The following objects are masked from ‘package:utils’: View”. This means that the new package we just loaded contains several functions whose names are similar to functions in packages we have already loaded. The warning is there to inform us that from then on, refer to these functions makes R look at the new package and not at the previous ones. To see which packages you have loaded, type:
##  "devtools" "usethis" "quanteda" "stats" "graphics" "grDevices" ##  "utils" "datasets" "methods" "base"
Here you see the order in which you loaded the packages, with the latest package occuring first. As quanteda was one of the packages we loaded last, when we would now use the View() command, R will carry out the version of this command as available in quanteda. In order to use the version in utils, we would have to type utils::View().
Thus, using the colons not only prevents accidents from happening, it also allows you to not load packages you do not want, and it makes it clearer for people reading your code with which packages you are working, increasing not only clarity, but also preventing misunderstanding.