Chapter 5 Using Models to Explore Your Data

The objectives of this chapter are to describe what the concept of a model is more generally, to explain what the purpose of a model is with respect to a set of data, and last, to describe the process by which a data analyst creates, assesses, and refines a model. In a very general sense, a model is something we construct to help us understand the real world. A common example is the use of an animal which mimics a human disease to help us understand, and hopefully, prevent and/or treat the disease. The same concept applies to a set of data–presumably you are using the data to understand the real world.

In the world of politics a pollster has a dataset on a sample of likely voters and the pollster’s job is to use this sample to predict the election outcome. The data analyst uses the polling data to construct a model to predict what will happen on Election Day. The process of building a model involves imposing a specific structure on the data and creating a summary of the data. In the polling data example, you may have thousands of observations, so the model is a mathematical equation that reflects the shape or pattern of the data, and the equation allows you to summarize the thousands of observations with, for example, one number, which might be the percentage of voters who will vote for your candidate. Right now, these last concepts may be a bit fuzzy, but they will become much clearer as you read on.

A statistical model serves two key purposes in a data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data were sampled. It’s sometimes helpful to understand what a model is and why it can be useful through the illustration of extreme examples. The trivial “model” is simply no model at all.

Imagine you wanted to conduct a survey of 20 people to ask them how much they’d be willing to spend on a product you’re developing. What is the goal of this survey? Presumably, if you’re spending time and money developing a new product, you believe that there is a large population of people out there who are willing to buy this product. However, it’s far too costly and complicated to ask everyone in that population what they’d be willing to pay. So you take a sample from that population to get a sense of what the population would pay.

One of us (Roger) recently published a book titled R Programming for Data Science. Before the book was published, interested readers could submit their name and email address to the book’s web site to be notified about the books publication. In addition, there was an option to specify how much they’d be willing to pay for the book. Below is a random sample of 20 responses from people who volunteered this information.

25 20 15 5 30 7 5 10 12 40 30 30 10 25 10 20 10 10 25 5

Now suppose that someone asked you, “What do the data say?” One thing you could do is simply hand over the data—all 20 numbers. Since the dataset is not that big, it’s not like this would be a huge burden. Ultimately, the answer to their question is in that dataset, but having all the data isn’t a summary of any sort. Having all the data is important, but is often not very useful. This is because the trivial model provides no reduction of the data.

The first key element of a statistical model is data reduction. The basic idea is you want to take the original set of numbers consisting of your dataset and transform them into a smaller set of numbers. If you originally started with 20 numbers, your model should produce a summary that is fewer than 20 numbers. The process of data reduction typically ends up with a statistic. Generally speaking, a statistic is any summary of the data. The sample mean, or average, is a statistic. So is the median, the standard deviation, the maximum, the minimum, and the range. Some statistics are more or less useful than others but they are all summaries of the data.

Perhaps the simplest data reduction you can produce is the mean, or the simple arithmetic average, of the data, which in this case is $17.2. Going from 20 numbers to 1 number is about as much reduction as you can do in this case, so it definitely satisfies the summary element of a model.