3.5 Short lab: Create data of different size
- Data comes in different formes
- Often in dataframe/matrix/table form
- A table/dataframe/matrix has the dimensions m (rows) by n (columns)
- Q: What are the data size measures? GB, MB, Byte etc.?
- In R you can easily create datasets of different sizes (try adapting the parameters at home)…
# install.packages("pacman")
::p_load(readr,
pacman
dplyr)<- 10000000 # Try 100000000
vector_length <- sample(1:10, vector_length, replace = TRUE)
vector <- 4
n.col <- length(vector)/n.col
n.rows <- matrix(vector,
data nrow = n.rows,
ncol = n.col) %>% data.frame()
write_csv(data, "./data/lab_outputs/data_artificial_10000000.csv") # change the path!
# Put this in a function if you like!
…a more complicated example where the data looks a bit more realistic and includes text data..
# install.packages("pacman")
::p_load(randomNames,
pacman
readr,
dplyr)<- 100000
n.rows <- data.frame(id = 1:n.rows,
data first.name = randomNames(1:n.rows, which.names="first"),
last.name = randomNames(1:n.rows, which.names="last"),
age = sample(15:90, n.rows, replace = T),
income = sample(500:3000, n.rows, replace = T)
)write_csv(data, "./Data & material/lab_outputs/data_artificial_100000.csv") # 2.7 MB
# How many rows (m) and columns (n) does that dataset have?
<- 1000000
n.rows <- data.frame(id = 1:n.rows,
data first.name = randomNames(1:n.rows, which.names="first"),
last.name = randomNames(1:n.rows, which.names="last"),
age = sample(15:90, n.rows, replace = T),
income = sample(500:3000, n.rows, replace = T)
)write_csv(data, "./Data & material/lab_outputs/data_artificial_1000000.csv") # 28 MB
# How many rows (m) and columns (n) does that dataset have?
<- 10000000
n.rows <- data.frame(id = 1:n.rows,
data first.name = randomNames(1:n.rows, which.names="first"),
last.name = randomNames(1:n.rows, which.names="last"),
age = sample(15:90, n.rows, replace = T),
income = sample(500:3000, n.rows, replace = T)
)write_csv(data, "data_artificial_10000000.csv") # 294 MB
# You wanna get stuck?
# Try setting n.row below
# n.rows <- 100000000