2.1 Definitions

A dataset is structured in rows and columns, with one row per observation (e.g., person) and one column per variable (e.g., age, gender, height, weight). Datasets can be stored on a computer in various formats – a format for a specific program (e.g., Excel, R, SAS, SPSS, Stata) or a structured Text file.

A few definitions:

  • A delimiter is the text that separates data values in a file. For example, a “comma delimited” file would have data values separated by spaces.
  • A file in fixed-width format has data values that start in specific locations.
  • A Text file could a .csv file (comma delimited) or a .txt file with any kind of delimiter (including comma) or fixed-width format.
    • A .txt file is a “text” file. Usually, .txt files default to opening in Notepad (Windows) or some other text editor.
    • A .csv file is a specific kind of text file that is comma-delimited and usually opens in Excel by default.

The file structure that most statistical software, including R, expects to find when reading a file:

  • One row per observation
    • With longitudinal data, there may be more than one row per person, but only one row per time for each person.
  • One column per variable
  • Some consistent way of knowing when one variable ends and the next begins (fixed width or delimited).
  • If there are parts of the datafile with metadata (e.g., descriptions of the data, a codebook) or data summaries (e.g., column means), save a copy of the file with these removed before importing the data into R.

Delimited means that each variable is separated by the same character. If any values have spaces in them (e.g., “no reply” which has a space) then spaces should NOT be used as delimiters.

Here is an example of the contents of a comma-delimited file:

Gender,Age

0,12 

0,25 

1,21 

1,18

Fixed width means that each variable takes up the same number of spaces. For example, here Gender takes up 7 columns and Age takes up 4 columns:

Gender Age 

0      12  

0      25  

1      21  

1      18  

We will focus on delimited data files and files from other formats. For more information on fixed width files, see ?read.fwf and ?readr∷read_fwf.

Read the following sections and try out the example code for yourself in your own R script file.