Chapter 4 Importing data
In this chapter:
Guidelines for importing data
Setting the stage for subsequent chapters
4.1 Introduction
The data has been collected. It might be the result of:
Running trials or experiments in a controlled laboratory setting.
Making observations and recording the results of those observations (as an astronomer or biologist might).
Sampling the people in an area and asking them a series of questions.
Collecting information as part of a business operation. A familiar example are supermarket checkout scanners which collect data about grocery purchases. That data then facilitates operation management of the store by tracking sales and inventory and generating orders.
Once the data has been collected, one of the first steps in the analytic process is to import the data—read the contents of the file (or files) and begin preparing that data for the analysis. You might have collected the data yourself and designed the data storage. Or there may be a layer of processing between the raw source and the way it appears to you. After this processing, you might:
Have access to a database where the raw data is stored, including where additional post-collection manipulation might happen,
Be sent a file that contains a sample of the raw data,
Be able to download data from a website, where the downloaded table has data that has already been compiled and summarized from a larger data set. An example would be a country’s census data tables.
And it’s important to note that your analysis project might require more than one of these methods of data collection.
How you assemble the data you need will depend on many factors, including what is already available, what your budget is (for example, some business-related data is collected by companies that then make it available at a cost), and the legal and regulatory environment (note that the definition of “personal information” varies from one jurisdiction to the next).
The specifics of the “import” step will differ by such things as the format of file and the variables, all of which we will explore in the next chapters.
4.2 Data formats
The data gets stored in a variety of electronic formats. The choice of format might be influenced by any one of the following:
The underlying needs of the data collectors (some file formats are tailored to a specific use);
The technology available to the collector;
The nature of the data being collected.
There is sometimes (often?) no right answer as to the best format for a particular use case—there are pros and cons to each. (With that said, there is often a clearly less good choice for data storage and sharing—we’re looking at you, PDF.) What this means is that in your workflow you will have to deal with data that needs to be extracted from a multitude of systems, and will be available to you in a multitude of formats.
It is essential that a statistician can talk to the database specialist, and, as a team member, the statistician, along with most others, will be expected to be able to use the database facilities for most purposes by themselves, and of course advise on aspects of the design. There is always much preliminary ‘data cleaning’ to do before an analysis can begin, almost regardless of how good a job is done by the database specialist. (Venables 2010)
There are plenty of resources detailing the complexities of the different data storage formats, and the decision process that goes into determining which format is appropriate for a specific use-case. I always approach the task assuming that the professionals who built the data storage system made a well-informed decision, including balancing the various trade-offs between different formats, as well as budgetary and technology constraints that they might have faced.
4.3 Importing data
Here’s some advice that’s worth heeding:
- The arguments in the import functions are your friends. Use them as your first line of defense in your project workflow.
“My main import advice: use the arguments of your import function to get as far as you can, as fast as possible. [Emphasis added.] Novice code often has a great deal of unnecessary post import fussing around. Read the docs for the import functions and take maximum advantage of the arguments to control the import.” (Bryan and The STAT 545 TAs 2019, chap. 9: Writing and reading files)
In practice, this means that the first iteration of an import function will almost invariably not yield what you want. We will soon see some examples of this, but it might be in the variable names or how values are stored, how missing values are represented or the geometry of the spreadsheet. The arguments in the read function will allow you to address all of these issues, and more. Go back and add arguments to the function, and run the code again, and again, until you get to where you need to be.
- Plain-text is boring but in the long run a more flexible and adaptable format; “Today’s outputs are tomorrow’s inputs” (Bryan and The STAT 545 TAs 2019, chap. 9: Writing and reading files)
“A plain text file that is readable by a human being in a text editor should be your default until you have actual proof that this will not work. Reading and writing to exotic or proprietary formats will be the first thing to break in the future or on a different computer. It also creates barriers for anyone who has a different toolkit than you do. Be software-agnostic. Aim for future-proof and moron-proof.” (Bryan and The STAT 545 TAs 2019, chap. 9: Writing and reading files)
4.3.1 Check your results
It’s always a good idea to quickly check your data after any major processing step. This can start with, but is not limited to, importing your data.
Some things to ask about the data:
Did the import step give you as many records (rows) as you expected?
Are there as many variables (columns) as you expected?
Are the variable types (or classes) for those variables what you anticipated?
Did variables that should be numbers load as numeric types, or as character?
If working with labelled variables, did they load as factor type?
How are missing values represented?
- As we will see in the next chapters, many import functions provide the flexibility to encode NA values based on specific characters.
We will look at structured ways investigate the contents of the data in the chapter Validation strategies. In the next few chapters, we will focus on the process of reading the contents of different types of files.
4.3.2 Conclusion
In order to import data from different file types, we require different R functions, which we will delve into in the next chapters. While the specifics vary from one file type to the next, there are two general principles worth noting:
Use the arguments of your import function to get as far as you can, as fast as possible.
Check your results at the end of the import step—don’t wait until later to discover that something went awry.