1.5 Tools and software we use
1.5.1 R: Why use it?
- Free and open source (think of science in developing countries)
- Good online-documentation
- Lively community of users (forums etc.)
- Pioneering role
- Visualization capabilities
- Intuitiv
- Cooperates with other programs
- Used across wide variety of disciplines
- Object-oriented programming language
- Popularity (See popularity statistics on books, blogs, forums)
- RStudio as powerful integrated development environment (IDE) for R
- Evolves into a scientific work suite optimizing workflow (replication, reproducability etc.)
- Institutions/people (Gary King, Andrew Gelman etc.)
- Economic power (Revolution Analytics, Microsoft R Open)
- Python is only real competitor.. can be used from R (e.g. reticulate package!)1
1.5.2 R: Where/how to study?
If you haven’t used R sofar it’s necessary that you learn some basics in R. As a participant of the seminar you get 6 months access to all the courses on DataCamp. DataCamp has become the go-to site for self-studying various data science skills (mostly software).
- See this site for an overview of the R courses they offer. Basically, datacamp offers a track “Data Scientist with R”.
- While the introduction is free for everyone you also have access to all other courses for six months.
If you like you can also have a look at the other options below but I would recommend that you start with data camp.
- Try R: A short interactive intro to the language can be found here: http://tryr.codeschool.com/
- Swirl: Learn R interactively within R itself: http://swirlstats.com/
1.5.3 R: Installation and setup
Below some notes on the installation and setup of R and relevant packages on your own computer:
Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/). If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don’t have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
Install the latest version of R from CRAN (https://cran.r-project.org/).
Install the latest version of RStudio (https://www.rstudio.com/products/RStudio/). RStudio is the editor we’ll rely on, i.e. we’ll write code in RStudio which is subsequently sent to and run within R.
Start RStudio and install & load the latest versions of various packages that we need.
You may also read up on how to create and “knit” an RMarkdown files. Essentially, such files allow you to integrate the analyses you conduct with the text you write which is ideal for reproducability. Here is an intro to the concept and a simple example: http://rmarkdown.rstudio.com/lesson-1.html.
1.5.4 Datacamp
- Adress: https://www.datacamp.com/
- 6-month access to do all courses you like (Q: Did everyone get the invitation?)
- Provide various tracks
- Q: What is your experience with Datacamp? Do you like it?
- Q: What makes Datacamp so powerful? Will it replace humans? (Memory!)
1.5.5 Google Cloud
- We obtained research credits, i.e., $s that each one of us can use to play around with their products.
- Q: Why does Google(or AWS) provide research credits to use their cloud?
- Steps (at home!)
- I will send you an invitation Email.
- Follow the steps in “Google Cloud Setup.gdoc”.
- Download “Google Cloud Research Credits Setup.Rmd” and “nytimes_headlines.csv” from the folder Data & material.
- Try out first accessing the cloud by adapting and using the code in “Google Cloud Research Credits Setup.Rmd”.
The seminar consists of a mix of theoretical and applied sessions. For the applied session we will rely on the software R. While there are various programs one could use, the reasons mentioned above speak for R (my personal view). The only real contendor for data science is Python. See here for a nice overview of the differences between the two.↩︎