Chapter 3 Create and load data sets

In practice R is used to analyse data. There exist two type of data. In the one hand, we have “real” data where “real” means that the data was observed. On the other hand, we have data is is created by the user. We need such data for example to analyse the small sample behavior of estimators.

3.1 Create data set

We start by considering data that is created by the user. We can create data sets by using functions like rnorm(), rexp or runif. After the data is created we can work with it as shown in the following:

n = 100  # number of observations

N = rnorm(n,mean=5,sd = 3)      # draw from normal distribution 
mean(N)
## [1] 5.246211
sd(N)
## [1] 3.041183
E = rexp(n, rate = 3)           # draw data from exponential distribution 
mean(E)
## [1] 0.374657
sd(E)
## [1] 0.3600557
U = runif(n, min = 3, max = 5)  # draw data from uniform distribution
mean(U)
## [1] 3.993143
sd(U)
## [1] 0.6204115

Functions as the three stated exist for a lot of distribution in R. You can find the needed functions by searching in the web.

3.2 Load dataset

It is also possible to work with observed data. Typically, such data is stored in some file and has to be loaded into R. In the example below we load a dta file.

library(foreign)
P = read.dta("ajrcomment.dta")

head(P)
##       longname shortnam step   mort logmort0 risk loggdp campaign slave source0
## 1       Angola      AGO    3 280.00 5.634789 5.36   7.77        1     0       0
## 2    Argentina      ARG    4  68.90 4.232656 6.39   9.13        1     0       0
## 3    Australia      AUS    4   8.55 2.145931 9.32   9.90        0     0       0
## 4 Burkina Faso      BFA    2 280.00 5.634789 4.45   6.85        1     0       0
## 5   Bangladesh      BGD    1  71.41 4.268438 5.14   6.88        1     0       1
## 6      Bahamas      BHS    4  85.00 4.442651 7.50   9.29        0     0       0
##   latitude neoeuro asia africa other edes1975 malaria other2 cons90 lado1995
## 1   0.1367       0    0      1     0        0   1.000      0      3        2
## 2   0.3778       0    0      0     0       90   0.000      0      6        5
## 3   0.3000       1    0      0     1       99   0.000      1      7        6
## 4   0.1444       0    0      1     0        0   1.000      0      1        4
## 5   0.2667       0    1      0     0        0   0.158      0      2        3
## 6   0.2683       0    0      0     0       10      NA      0     NA        4
##   ajr_rnd2
## 1        0
## 2        0
## 3        0
## 4        1
## 5        1
## 6        0

After the file is loaded we can work with the data

P$mort[1:5]
## [1] 280.00  68.90   8.55 280.00  71.41
P[1:5,4]
## [1] 280.00  68.90   8.55 280.00  71.41
mean(P[1:5,4])
## [1] 141.772
sd(P[1:5,4])
## [1] 128.6693

Apart from read.dta there exist also read.csv, read.xls and many more to load datasets.