The Art of Data Science

3.5 Case Study

Joe works for a company that makes a variety of fitness tracking devices and apps and the name of the company is Fit on Fleek. Fit on Fleek’s goal is, like many tech start-ups, to use the data they collect from users of their devices to do targeted marketing of various products. The product that they would like to market is a new one that they have just developed and not yet started selling, which is a sleep tracker and app that tracks various phases of sleep, such as REM sleep, and also provides advice for improving sleep. The sleep tracker is called Sleep on Fleek.

Joe’s boss asks him to analyze the data that the company has on its users of their health tracking devices and apps to identify users for targeted Sleep on Fleek ads. Fit on Fleek has the following data from each of their customers: basic demographic information, number of steps walked per day, number of flights of stairs climbed per day, sedentary awake hours per day, hours of alertness per day, hours of drowsiness per day, and hours slept per day (but not more detailed information about sleep that the sleep tracker would track).

Although Joe has an objective in mind, gleaned from a discussion with his boss, and he also knows what types of data are available in the Fit on Fleek database, he does not yet have a question. This scenario, in which Joe is given an objective, but not a question, is common, so Joe’s first task is to translate the objective into a question, and this will take some back-and-forth communication with his boss. The approach to informal communications that take place during the process of the data analysis project, is covered in detail in the Communication chapter. After a few discussions, Joe settles on the following question: “Which Fit on Fleek users don’t get enough sleep?” He and his boss agree that the customers who would be most likely to be interested in purchasing the Sleep on Fleek device and app are those who appear to have problems with sleep, and the easiest problem to track and probably the most common problem is not getting enough sleep.

You might think that since Joe now has a question, that he should move to download the data and start doing exploratory analyses, but there is a bit of work Joe still has to do to refine the question. The two main tasks Joe needs to tackle are: (1) to think through how his question does, or does not, meet the characteristics of a good question and (2) to determine what type of question he is asking so that he has a good understanding of what kinds of conclusions can (and cannot) be drawn when he has finished the data analysis.

Joe reviews the characteristics of a good question and his expectations are that his question has all of these characteristics: -of interest -not already answered -grounded in a plausible framework -answerable -specific

The answer that he will get at the end of his analysis (when he translates his question into a data problem) should also be interpretable.

He then thinks through what he knows about the question and in his judgment, the question is of interest as his boss expressed interest.

He also knows that the question could not have been answered already since his boss indicated that it had not and a review of the company’s previous data analyses reveals no previous analysis designed to answer the question.

Next he assesses whether the question is grounded in a plausible framework. The question, “Which Fit on Fleek users don’t get enough sleep?”, seems to be grounded in plausibility as it makes sense that people who get too little sleep would be interested in trying to improve their sleep by tracking it. However, Joe wonders whether the duration of sleep is the best marker for whether a person feels that they are getting inadequate sleep. He knows some people who regularly get little more than 5 hours of sleep a night and they seem satisfied with their sleep. Joe reaches out to a sleep medicine specialist and learns that a better measure of whether someone is affected by lack of sleep or poor quality sleep is daytime drowsiness. It turns out that his initial expectation that the question was grounded in a plausible framework did not match the information he received when he spoke with a content expert. So he revises his question so that it matches his expectations of plausibility and the revised question is: Which Fit on Fleek users have drowsiness during the day?

Joe pauses to make sure that this question is, indeed, answerable with the data he has available to him, and confirms that it is. He also pauses to think about the specificity of the question. He believes that it is specific, but goes through the exercise of discussing the question with colleagues to gather information about the specificity of the question. When he raises the idea of answering this question, his colleagues ask him many questions about what various parts of the question mean: what is meant by “which users”? Does this mean: What are the demographic characteristics of the users who have drowsiness? Or something else? What about “drowsiness during the day”? Should this phrase mean any drowsiness on any day? Or drowsiness lasting at least a certain amount of time on at least a certain number of days? The conversation with colleagues was very informative and indicated that the question was not very specific. Joe revises his question so that it is now specific: “Which demographic and health characteristics identify users who are most likely to have chronic drowsiness, defined as at least one episode of drowsiness at least every other day?”

Joe now moves on to thinking about what the possible answers to his questions are, and whether they will be interpretable. Joe identifies two possible outcomes of his analysis: (1) there are no characteristics that identify people who have chronic daytime drowsiness or (2) there are one or more characteristics that identify people with chronic daytime drowsiness. These two possibilities are interpretable and meaningful. For the first, Joe would conclude that targeting ads for the Sleep on Fleek tracker to people who are predicted to have chronic daytime drowsiness would not be possible, and for the second, he’d conclude that targeting the ad is possible, and he’d know which characteristic(s) to use to select people for the targeted ads.

Now that Joe has a good question in hand, after iterating through the 3 steps of the epicycle as he considered whether his question met each of the characteristics of a good question, the next step is for him to figure out what type of question he has. He goes through a thought process similar to the process he used for each of the characteristics above. He starts thinking that his question is an exploratory one, but as he reviews the description and examples of an exploratory question, he realizes that although some parts of the analysis he will do to answer the question will be exploratory, ultimately his question is more than exploratory because its answer will predict which users are likely to have chronic daytime drowsiness, so his question is a prediction question. Identifying the type of question is very helpful because, along with a good question, he now knows that he needs to use a prediction approach in his analyses, in particular in the model-building phase (see Formal Modeling chapter).