Chapter 4 Data import

4.1 Introduction

Data import is one of the first steps in any data analysis workflow. R provides powerful tools to read and import data from a wide variety of sources, including CSV files, Excel spreadsheets, URLs, and databases like SQL. This chapter will guide you through the most common methods for importing data into R, ensuring you can efficiently work with the data you need.R includes native functions for reading standard data formats and supports external libraries for handling specialized file types. Before importing data, it is essential to understand the structure of your dataset and its file format.

Common data sources:

  • Flat files: CSV, TXT, TSV.

  • Spreadsheets: Excel (.xlsx, .xls).

  • Online resources: Files hosted on URLs.

  • Databases: SQL databases.

  • Other formats: JSON, XML, and more.

4.2 Importing CSV and Text Files

CSV (Comma-Separated Values) is one of the most widely used data formats. R provides multiple methods for importing such files.

4.2.1 Using read.csv()

The base R function read.csv() is a quick and simple method to load CSV files.

Example:

data <- read.csv("data/my_data.csv")  
head(data)  

4.2.2 Using read.table()

For more control over delimiters, use read.table().

Example:

data <- read.table("data/my_data.txt", sep = "\t", header = TRUE)  

4.2.3 Using data.table::fread()

The fread() function from the data.table package is optimized for speed and works with various delimiters.

install.packages("data.table")  
library(data.table)  
data <- fread("data/my_data.csv")  
head(data)  

4.3 Importing Excel Files

Excel files are common in business and academic settings. R supports Excel file imports using several packages.

4.3.1 Using readxl

The readxl package is lightweight and does not require Excel to be installed on your system.

install.packages("readxl")  
library(readxl)  
data <- read_excel("data/my_data.xlsx", sheet = 1)  
head(data)  

4.3.2 Using openxlsx

The openxlsx package provides additional functionality, including writing data to Excel files.

install.packages("openxlsx")  
library(openxlsx)  
data <- read.xlsx("data/my_data.xlsx", sheet = 1)  
head(data)  

4.4 Importing Data from a URL

Sometimes, data is hosted online and can be directly imported into R using URLs.

4.4.1 Reading CSV from a URL

url <- "https://example.com/data.csv"  
data <- read.csv(url)  
head(data)  

4.4.2 Downloading a File and Importing

download.file("https://example.com/data.xlsx", destfile = "data.xlsx")  
library(readxl)  
data <- read_excel("data.xlsx")  

4.5 Importing Data from Databases (SQL)

R supports various databases, including SQLite, MySQL, and PostgreSQL. The DBI and RSQLite packages provide robust tools for database interaction.

4.5.1 Connecting to a SQLite Database

install.packages("DBI")  
install.packages("RSQLite")  
library(DBI)  

# Connect to the database  
con <- dbConnect(RSQLite::SQLite(), "data/my_database.sqlite")  

# Query data  
data <- dbGetQuery(con, "SELECT * FROM my_table")  

# Disconnect  
dbDisconnect(con)  

4.5.2 Connecting to MySQL or PostgreSQL

Use the RMySQL or RPostgres packages.

Example (MySQL):

install.packages("RMySQL")  
library(DBI)  

# Connect to the database  
con <- dbConnect(RMySQL::MySQL(),  
                 dbname = "my_database",  
                 host = "localhost",  
                 user = "username",  
                 password = "password")  

# Query data  
data <- dbGetQuery(con, "SELECT * FROM my_table")  

# Disconnect  
dbDisconnect(con)  

4.6 Other Formats (JSON, XML)

R also supports less common data formats, such as JSON and XML.

4.6.1 Importing JSON Files

Use the jsonlite package to parse JSON files.

install.packages("jsonlite")  
library(jsonlite)  
data <- fromJSON("data/my_data.json")  
head(data)  

4.6.2 Importing XML Files

Use the xml2 package for XML data.

install.packages("xml2")  
library(xml2)  
data <- read_xml("data/my_data.xml")  

4.7 Troubleshooting Data Import Issues

  • File Not Found: Double-check file paths or URLs. Use file.exists() to verify paths.

  • Encoding Problems: Use the encoding parameter in import functions for non-UTF-8 data.

  • Missing Libraries: Ensure required packages are installed.

  • Slow Performance: For large files, use optimized packages like data.table.

4.8 Summary

Efficient data import is essential for seamless data analysis in R. By understanding and using R’s diverse tools, you can quickly load data from various sources, including flat files, spreadsheets, URLs, and databases. Mastering these techniques ensures you are well-equipped to handle diverse datasets in your projects.