Chapter 1 Data Science
What is data science? A useful definition is best derived by breaking the concept into its component parts. The term data refers to raw facts in the form of numbers, symbols, words, images, or any other record of an observed phenomenon. However, data without context is of limited use. We must assign meaning to data before it can be considered useful information. The list of numbers \((32.75,33.32,33.42,...)\) on its own is meaningless. But if we learn these numbers represent the finishing times (in minutes) of the elite women’s division in a 10,000 meter foot race recorded in the month of May in the current year, then we have genuine information. By thoroughly investigating information we gain insight. With insight, we have a deeper understanding of the behaviors and relationships among the observational units of a system or domain. Perhaps we learn that the typical elite woman finishes the aforementioned race in \(35.5\) minutes. Such insight could be applied to answer important questions regarding human performance or race logistics. It is this conversion of data to insight and its application to problem solving that comprise the science in data science.
The scientific method has been applied to gain insight and solve problems in a vast array of domains for centuries. Yet the reference to “data science” as a unique academic discipline and industry profession is relatively new. Consequently, we begin this text with a brief discussion of the evolution of data science.
1.1 Data History
Although the mathematical foundations date back to at least the 17th century, many designate the mid-20th century as the origin of modern data science. In his 1962 article The Future of Data Analysis, John W. Tukey advocated for a broader discipline, distinct from mathematical statistics (Tukey 1962). For much of his career, Tukey identified as a mathematical statistician. But his interests expanded and evolved over time.
“All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
While Tukey labeled his interest as data analysis, the combined activities described above are much closer to what we now consider data science. Gathering, analyzing, and interpreting data using the tools of mathematical statistics is at the very heart of data science. However, this title would not be used definitively for another decade.
In his 1974 book Concise Survey of Computer Methods, computer scientist Peter Naur explicitly defined data science (Naur 1974).
“Data science is the science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
Much like Tukey, Naur focused on analyzing (“dealing with”) data and implied that gathering (“have been established”) and interpreting (“what they represent”) data were something separate. In the modern definition, we include the entire process of gathering, analyzing, and interpreting data into the science. Regardless, these early pioneers established the enduring connection between data science, statistics (Tukey), and computer science (Naur). Tukey referred to the “machinery” of data science in a figurative sense, while Naur invoked a much more literal meaning. That said, it would take yet another decade before the physical machinery of personal computers became widely available outside of government labs and academic institutions.
In 1981, IBM Corporation announced the release of the IBM Personal Computer (IBM 1981). Not long after, in 1984, the first Macintosh personal computer was released by Apple Incorporated (Apple 1984). While personal computers grew in popularity and prevalence throughout the 1980s, the limitations of stand-alone systems would not be widely overcome until the mid-1990s with the introduction of the web browser. Though networked computers had existed within the government since the 1960s, public access to the World Wide Web via the internet did not gain widespread popularity until roughly 1995.
On the verge of the 21st century, the academic and technological catalysts for the data science boom were already in place and would continue to advance at a staggering rate. The missing ingredient was advocacy. For his part, computer scientist William S. Cleveland published Data Science: an action plan for expanding the technical areas of the field of statistics with a focus on the development of practicing data analysts (Cleveland 2001). One of the primary outcomes suggested by Cleveland’s action plan was a strong partnership between mathematical and computer science departments at universities. The implication being that data science faculty should have equal capacity in mathematics and computing. Another critical outcome challenges the idea that data science is reserved for only mathematicians and computer scientists.
“A very limited view of data science is that it is practiced by statisticians. The wide view is that data science is practiced by statisticians and subject matter analysts alike, blurring exactly who is and who is not a statistician. The wide view is the realistic one because all the statisticians in the world would not have time to analyze a tiny fraction of the databases in the world. The wide view has far greater promise of a widespread influence of the intellectual content of the field of data science.”
In other words, data science is an inherently interdisciplinary and inclusive field that should be learned and applied by all. Cleveland’s advocacy highlighted the need for universities to embrace data science teaching and research across departments. Further advocacy was provided by industry in 2009 when Google chief economist Hal R. Varian spoke with McKinsey and Company (Varian 2009). On the subject of future technological needs, Varian singled out data science.
“The ability to take data–to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it–that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.”
With growing support, university programs in data science slowly emerged at the graduate and then undergraduate levels. In 2017, the Park City Math Institute (PCMI) published Curriculum Guidelines for Undergraduate Programs in Data Science to guide universities wishing to create a major in data science (DeVeaux et al. 2017). Based on the wisdom of 25 faculty members from universities across the United States, PCMI defined data science as the “science of planning for, acquisition, management, analysis of, and inference from data”. Interestingly, this definition aligns well with what Tukey and Naur described decades prior in terms of gathering, analyzing, and interpreting data. The published guidelines provide key competencies of an undergraduate data science major.
- Computational and statistical thinking
- Mathematical foundations
- Model building and assessment
- Algorithms and software foundation
- Data curation
- Knowledge transference–communication and responsibility
This text seeks to introduce undergraduate students to the PCMI competencies and how they are applied to solve data-driven problems. The topics presented herein provide a strong foundation to support future course work culminating in a bachelor’s degree in data science. However, all students, regardless of academic pursuit, could benefit from the technical problem solving experience gained throughout the text.
At its core, data science is about answering questions and solving problems. Moreover, defensible problem solving requires proficiency and process. By proficiency, we mean the knowledge, skills, and abilities required to extract valuable insights from data and apply them in a particular domain. Without such capabilities, data scientists will struggle to solve the challenging technical problems that unlock a competitive advantage. At the same time, even the most capable practitioners can profit from a transparent, repeatable process. Various disciplines, including data science, have benefited from some form of the scientific method for centuries. The careful application of a defined process ensures the connection between question and answer is logical and repeatable. In the remaining sections of the chapter, we introduce the proficiency and process required for data-driven problem solving.
1.2 Proficiency
All disciplines and professions include requisite knowledge, skills, and abilities. Whether in research laboratories or commercial industries, the practice of data science is no different. In order to provide valuable insights and achieve long-term success, a data scientist must have experience with certain academic fields (knowledge), technical resources (skills), and domain-specific applications (abilities). We’ll explore each of these in turn.
1.2.1 Knowledge
While it is certainly possible to obtain the knowledge of a data scientist outside of formal academic institutions, conferred degrees and transcripted course work are the most common evidence of such knowledge. Degrees with the literal name “data science” are a 21st-century concept, but the academic building blocks reach back much further into history. Foundational disciplines such as applied mathematics, computer science, and project management all contribute to what we now consider the unique field of data science. So, what knowledge should a data scientist draw from these supporting disciplines?
Within the discipline of applied mathematics the crucial knowledge areas include modeling, probability, and statistics. Of course these areas typically leverage a base in linear algebra and calculus, but we will focus on the former. At its basic level, mathematical modeling seeks to represent the performance of real-world systems using a simulated proxy. This approach is useful when the system is inaccessible or too costly (in terms of time, money, and effort) to manipulate in reality. Probability is the study of uncertain events and is key to modeling stochastic (uncertain) systems. Since the association between inputs and outputs in most real-world systems is not deterministic (certain), knowledge of probability theory is paramount to data science. Finally, statistics involves the estimation of parameters and associations within a population based on sample data. Just as a real-world system is often inaccessible, we seldom have access to the individual characteristics of an entire population of interest. Thus, knowledge of proper sampling methods is critical. Putting all three of these branches of applied mathematics together permits data scientists to develop statistical models of stochastic systems.
Data science also incorporates knowledge from the discipline of computer science. Key areas include data structures, programming, and automation. Data structures refer to the organization, management, and storage of data. Various data types and applications lend themselves to particular structures, so it is vital that data scientists obtain and apply this knowledge. Computer programming, which is often referred to as “coding”, comprises much more than learning the syntax of a specific language. In particular, programming includes the development, testing, and implementation of computing algorithms. Modern statistical models are constructed by executing optimization algorithms using massive amounts of data. This cannot be accomplished without a blend of applied mathematics and computer science knowledge. Nor can it be accomplished without automation. Automation alludes to the growing fields of machine learning and artificial intelligence. Contemporary algorithms are deployed on high-performance computing machines and consume data at a scale beyond a human’s manual capacity.
In addition to applied mathematics and computer science, successful data scientists understand the fundamentals of project management. Though by no means an exhaustive list, essential areas of knowledge include collaboration, communication, professional ethics, and quality assurance. Data scientists apply their mathematical and computer science knowledge by collaborating within teams to solve problems. This cannot be achieved without a sufficient understanding of best practices in oral, written, and visual communication. This should not be achieved without ethical guidelines for collecting, storing, and using personally-identifiable information or demographic data. Closely related to ethics, data scientists must document, replicate, and verify their work to ensure high-quality results. Thus, while the mathematical and computing skills often receive much of the focus, prospective data scientists should never lose sight of the valuable knowledge offered within the discipline of project management.
Along with distinct knowledge gained primarily through academic education, data scientists must have the technical skills to implement that knowledge using contemporary hardware, software, and interpersonal communication. We explore these skills in the next section.
1.2.2 Skills
While knowledge represents conceptual understanding, skills imply the capacity to physically demonstrate that understanding. Skills tend to be more dynamic and perishable than knowledge. For example, a data scientist might have a strong fundamental understanding of computer programming that endures for decades, but the preferred coding language could change intermittently over that time period. Thus, it is vital that data scientists remain current with the technical skills in the field as they evolve. Here we describe some of the common “brand names” within each skill set as of this writing.
Contemporary data science requires the ability to code in multiple programming languages. This text focuses on R as the preferred language for data analysis. However, there are many other languages that support the broader data science process. Foremost among those are Python, SQL, and JavaScript. Python is a general-purpose programming language with the flexibility to support a variety of activities. Often it is employed as part of the extract, transform, and load (ETL) process to aggregate distributed data, but it can also be used for data analysis. SQL (structured query language) is specifically designed to create, manage, and query databases. Prior to analysis, data scientists must acquire the necessary data and this is often achieved using SQL code. JavaScript is a popular language among web developers. In many cases, the final results of a data scientist’s work are deployed on the World Wide Web. As a result, some knowledge of JavaScript can ease the collaboration between scientists and developers.
Although data scientists can program in their favorite language directly at a command prompt, it is often more efficient (or at least more user-friendly) to write code using integrated development environments (IDEs). These software applications provide more “point-and-click” and error-checking options than a command console. For R the most common IDE by far is RStudio. For Python popular IDEs include IDLE, Spyder, and Jupyter. For SQL many data scientists employ pgAdmin, Adminer, or DBeaver. For JavaScript common examples are Visual Studio, Eclipse, and NetBeans. Some IDEs are open-source (i.e., free), but offer little to no support. Others can be quite costly, but offer some customer services.
While SQL and its associated IDEs are the most common methods to interface with databases, the databases themselves take on various frameworks. Broadly speaking, data management systems can be distinguished as relational (aka SQL) or non-relational (aka NoSQL). Prevailing relational databases include PostgreSQL, MySQL, MariaDB, and Oracle. These platforms are highly-structured and organize data into tables with well-defined rows and columns. By contrast, non-relational databases store unstructured data such as documents, photos, and videos in highly-flexible formats. Current examples include MongoDB, Redis, DynamoDB, and Cassandra.
We typically learn data science by solving relatively small problems on our own personal computers. But this is not common practice in industry. Modern data science problems involve massive amounts of data managed on distributed, high-performance computing platforms. This has become known as “big data”. Though the term “big data” is a bit ambiguous and admittedly over-used, there are utilities specifically designed to address what the term implies. The Apache Software Foundation has developed a wide variety of software for large-scale data processing. Well-known platforms include Hadoop, Spark, and Kafka. At a basic level, these systems distribute the workload over multiple machines in order to store and quickly process large-scale data. The primary trade-offs in such systems involve throughput (amount of data) versus latency (time to completion) and batch processing versus streaming.
Related to distributed computing, modern data science is often conducted on cloud-based platforms. Amazon Web Services, Windows Azure, and Google Cloud all offer access to remote machines for storing and accessing data, as well as developing and deploying algorithms. Even IDEs such as RStudio Cloud (now Posit Cloud) are often employed remotely. Cloud-based computing can offer greater accessibility, security, and performance when compared to the on-site resources available to most data scientists. The primary trade-off is connectivity. If the cloud-based server is down or the internet connection is interrupted, then data scientists lose access to their data and IDEs.
Data scientists seldom work alone. Instead technical projects are planned and executed in collaborative environments. Popular code-sharing and version-control platforms include GitHub, GitLab, Bitbucket, and Sourceforge. These applications allow multiple data scientists to work on the same source code simultaneously and to aggregate independent edits in an organized fashion. In other words, they support parallel versus sequential development. More generally, project teams benefit from engaging in a centralized environment. Applications such as Teams, Slack, and Confluence facilitate documentation and transmission of information throughout the completion of a project.
Finally, successful data scientists are well-versed in communication skills. Common tools for visual communication include Tableau, PowerBI, and Shiny. These dashboards permit interactive visualization that can be tailored to a particular audience and allow for live response to spontaneous questions. In terms of written communication, the Microsoft Office suite remains a popular choice. However, technical reports that require the inclusion of equations and code are often completed using LaTeX editors. In fact, this text is written in such a fashion using RMarkdown. In addition to the technical tools for visual and written communication, the most successful data scientists possess strong oral and interpersonal communication skills. The capabilities to speak confidently, listen carefully, and think critically are imperative.
Academic knowledge and technical skills should be combined and applied to solve real-world problems in a wide variety of domains. In any discipline, the ability to apply knowledge and skills is what separates theory from practice. In the next section, we summarize some of these key application abilities in the world of data science.
1.2.3 Abilities
It is difficult to conceive of a domain that could not benefit from the knowledge and skills of a data scientist. The competitive advantage offered by data-driven insights continues to be recognized more and more throughout government, industry, and academia. Consequently, it is nearly impossible to list all of the application areas in which a data scientist provides value. Here we discuss a few of these domains, but challenge the reader to find application opportunities in every domain they encounter.
One of the major application areas for data science is online customer service. Artificial intelligence chatbots are driven by machine learning algorithms. The development of such algorithms requires a wide array of data science knowledge and skills. The customer service industry continues to expand its use of chatbots to replace human agents for frequently asked questions and common tasks. When coupled with online search engines, chatbots also provide a powerful means of rapid research and development.
Another domain experiencing a growing application of data science is targeted marketing. Statistical models are tailor made for learning shopping and purchasing habits and using that information to suggest future products of interest. Online companies have mastered the use of statistical learning algorithms to personalize experiences for their shoppers. Other data science techniques, such as spatial analysis, can also be applied to determine the best market location in which to build new facilities.
Within the finance industry, data science competencies are applied for fraud detection and loss prevention. The ability to predict future fraud or loss is an enormous component of risk management and data science offers solutions. Statistical models also play a big role in stock projections for many commodities. Traders gain an incredible competitive advantage from models that accurately predict future market movements. The construction of such models relies heavily on data science knowledge and skills.
The health care domain has prospered greatly from the insights offered by data science. One need look no further than the scientific response and recovery from the COVID-19 pandemic to find widespread applications of data science. The capacity to gather, analyze, and interpret infection and vaccination rates was fundamental to successfully ending the pandemic. Even the design of experiments to develop a vaccine (or any new medicine) is rooted in data science. Another application in health care is image recognition. Statistical algorithms can learn to recognize abnormalities in medical imagery (e.g., CT scans or MRIs) to supplement human diagnoses. Ideally such models reduce the occurrence of false positives and negatives.
Yet another domain applying data science competencies is the transportation industry. On many toll roads there are no longer human tellers at a booth. Instead, image recognition software driven by machine learning algorithms identifies license plate numbers and automates bill delivery. Airlines are increasingly leveraging data science for route planning by predicting delays, demand, and weather. Statistical models also offer tremendous benefits when determining real-time pricing for parking or ticketing across multiple modes of transportation.
As a final example, the sports industry is employing a plethora of data science tools to gain a competitive advantage. At the tactical level, wearable technology and spatial imaging are combined to explain and improve player performance via the associated data and models. Similarly, data scientists predict opportunity and efficiency for players and teams in real and fantasy sports. Even the sports betting industry increasingly relies on data science to determine the appropriate odds to offer customers.
As previously mentioned, these domains are only a few that currently profit from the knowledge and skills of a data scientist. Many other domains across the arts, sciences, and humanities can and do benefit as well. Once a data scientist gains the requisite knowledge, skills, and abilities, how are they employed to solve real-world problems? While the complexity and magnitude of data-driven problems vary, we follow a repeatable problem-solving process to progress from asking a question to providing an answer. We call this process the 5A Method and describe it fully in the next section.
1.3 Process
In order to apply the knowledge, skills, and abilities of a data scientist to solve meaningful problems, it is helpful to follow a repeatable methodology. There exists no shortage of suggested processes and associated guidance on how a data-driven problem should be solved. But the method presented here attempts to synthesize the important ideas shared by all of these processes.
The 5A Method shown in Figure 1.1 consists of asking a research question, acquiring data relevant to that question, analyzing the data to gain important insights, advising others on those insights, and finally answering the question. Though the steps of this process are displayed in a sequential fashion, it is quite common to experience some back-and-forth between and across steps as new issues or ideas reveal themselves. The data science knowledge, skills and abilities presented in the previous section pervade all five steps of the problem solving process. Let’s examine each step of the 5A Method in more detail.
1.3.1 Ask a Question
Meaningful problems generate interesting questions. Often data scientists are asked questions by domain experts or stakeholders. Other times, data scientists might ask their own questions in the midst of a research project. In any case, a carefully-constructed problem statement should spawn good questions that can be answered with data-driven insights. So, what is a “good” question?
One characteristic of a good question is that it is concerning to the stakeholder. Given a large, diverse data set there are a seemingly unlimited number of questions we might ask and answer. But many of them are unlikely to be of concern to the stakeholder. Though this characteristic might seem obvious, it is surprisingly easy for a data scientist to expend untold time and effort answering a question that is interesting but not truly of concern to the stakeholder. Careful collaboration can help ensure the focus remains on questions relevant to the problem at hand.
Good questions are also specific. Much like irrelevant questions, vague questions can lead to wasted effort that does not contribute to solving the stakeholder’s problem. One reason for this is that vague questions are often comprised of many specific questions. What if the stakeholder only has one of the many specific questions in mind, but the data scientist interprets the need as a different question? It is worth the time and effort up front for the stakeholder and data scientist to identify the specific question (or questions) that need to be answered.
Another characteristic of a good question is that it is currently unanswered. If the scientific community has already thoroughly and consistently demonstrated the answer to a question, then it may not be worth the time and effort to seek the solution to a problem that has already been solved. That said, replication is an exceedingly important and necessary part of the scientific process. The phrase “thoroughly and consistently demonstrated” is key here. We should not consider a question answered just because a single study found a solution within a single context. Instead, we should reference and contribute to the body of work until the question is truly answered.
On a related note, a good scientific question is answerable. There are many concerning, specific, unanswered questions in disciplines such as philosophy and theology that are unlikely to be answered by data science methods. For our purposes, we focus on problems for which relevant data exists that can plausibly inform an actionable solution. At the very least, it must be true that the data can exist even if it has not yet been collected. Thus, data scientists and stakeholders must collaborate to determine what data is required to inform a solution. Not only should a question be tactically answerable, in terms of available data, but it should also be conceivably answerable. If the question seeks trends or associations within the data that make no practical or logical sense, then the answer is likely spurious.
Just because a question can be answered, does not mean it should be answered. Good questions should be ethical. Much like physicians have an obligation to “do no harm”, data scientists should not contribute to research that seeks to discriminate, marginalize, or otherwise damage the lives of others. Sadly, this is not always as obvious as we might think. The proliferation of artificial intelligence, driven by machine learning algorithms, might appear to be an objective means of removing human prejudices from decision-making. But if such algorithms are trained on data derived from a biased human system, then the associated output will simply reflect the same biases.
By the end of the first step of the 5A Method, we have a good question (or questions) in hand. Additionally, after collaboration with stakeholders, we have an idea of the data necessary to answer the question(s). The next step is to acquire that data.
1.3.2 Acquire the Data
As part of generating a good question, we identify the data that can plausibly contribute to an answer. The description of the necessary data might initially be relatively conceptual, but we must formalize its structure in the acquisition phase in preparation for analysis. We discuss data collection and wrangling in much greater detail in Chapter 2. For now, we limit our focus to the broader strokes of how data is obtained.
The first step in acquisition is to determine if and where the data already exists. Though seldom the case, it could be that every bit of required data already exists in a single location and is freely available. When employed by an established company with a well-developed and maintained database, a data scientist might find themselves in the enviable position of having access to all the necessary data. More often, it is the case that the data is proprietary, costly, or distributed across multiple platforms that must be aggregated. Other times the data does not exist at all and must be collected from the ground up. Similar to researching whether a question has already been answered, we should also determine whether data has already been collected.
If the data does exist and is accessible, we should investigate the reliability of the source. Was the data collected using sound scientific methods? Could any biases exist? This is the trade-off inherent in using existing data. We can avoid the time, money, and effort required to collect the data ourselves, but we must assess the quality of other people’s research. That said, good data scientists document their work and make it easy for others to build upon what they have accomplished. We should seek to do the same when collecting and distributing our own data.
If the data does not exist or is not accessible, then we must collect it ourselves via observation or experimentation. Observational collection means we are not participating in the data-generating process. This could involve the physical observation and manual tabulation of data during an in-person event or a connection to live streaming data online. Regardless, observational data is collected during some process without any explicit interaction on the part of the observer. By contrast, experimental collection involves careful planning and design of the data-generating process before it happens. In this case, we (or other scientists) are directly involved in controlling the inputs to the process. A common example is a clinical trial where subjects are assigned to control or treatment groups. Controlled experiments tend to be more costly, but provide access to inferences (e.g., causality) that are typically not available with observational studies. Experiments can also be conducted virtually using computer simulations when the associated system is well-defined.
Once the data is collected, we must decide the structure and format in which it will be stored. In the ideal case, all of the data is already stored and maintained in a database. Depending on the circumstances, we might have direct access to the entire database. More often, we have limited access to specific tables within a database via an application programming interface (API). Regardless, in these situations the data storage structure is already determined and we simply query the database when needed. On the other hand, if the data is collected from scratch, then we might need to aggregate multiple file types and decide how best to store the resulting data for future use. The storage method could be as simple as saving files on a hard drive or as complex as constructing a cloud-based data management system.
The process of collecting, aggregating, and storing data inevitably results in mistakes or oversights. Hence, we often need to clean the data prior to analysis. Common data cleaning procedures include resolving missing or duplicated values. When values are missing, we must decide whether the value can be determined or whether to assign a reasonable proxy value. When values are duplicated, we must decide whether it is an administrative error or a truly-repeated occurrence. Another typical data cleaning practice is to identify extreme or unlikely values. Similar to duplicates, such values could be the result of an error or a genuine outlier. In either case, their presence in the data should be resolved prior to analysis.
As with the development of good questions, data should be collected and stored in an ethical manner. Data collection processes raise concerns regarding informed consent. This is particularly true with controlled experiments for which subjects are assigned treatments. For example, in a study of what causes lung cancer it might be technically useful to assign subjects to a treatment group that is required to smoke tobacco as a contrast to a control group that does not smoke. However, given the demonstrated correlations between smoking and lung cancer, this would be ethically questionable. Are the subjects fully informed and genuinely consenting to a treatment likely to negatively impact their health? In addition to issues with informed consent, data storage heightens concern for privacy and security. Many studies and experiments involve personally-identifiable information such as date of birth, location, and demographics. Data scientists have an ethical obligation to care for the security of this information.
By the end of the second step of the 5A Method, we have a good question and a clean, reliable, ethical data set. Ideally, that data set captures all of the information relevant to the question. The next step is to analyze the data to draw insights that can inform an answer.
1.3.3 Analyze the Data
After formulating a good research question and acquiring the requisite data to answer it, we utilize a variety of analytic tools to extract valuable insights. The decision of which tool(s) to employ is largely driven by the type of question. The three basic types of questions that data scientists face are exploratory, inferential, and predictive. The tools associated with each type are examined in much greater detail in Chapters 3, 4, and 5, respectively. However, we discuss them briefly here.
In an exploratory analysis, we seek to describe characteristics of the data and/or to discover associations or trends within the data. An exploratory analysis can be thought of as internal to the data in hand. There is no intent to infer or predict behavior external to the existing data. That said, an exploratory analysis is often employed to generate hypotheses that can later be tested in an inferential analysis, or to identify potentially-valuable features for a predictive analysis. Regardless, the associated insights are restricted to the scope of the extant data.
As alluded to previously, an inferential analysis aims to extend the insights gained from one data set to another data set. More specifically, we identify characteristics or associations within a sample and use them to estimate the same characteristics or associations within a population. Inferential analyses require many technical conditions to ensure that the sample is representative of the population and that the estimates are accurate and precise. Much of the effort associated with inference regards checking and resolving such conditions. However, the benefit is the capacity to estimate the behavior of an inaccessible population using an accessible sample.
A predictive analysis also extends insights from accessible data to inaccessible data. But, rather than from sample to population, the extension can be thought of as from past to future. In other words, we leverage what we observed in the past to predict what we might observe in the future. Predictive analyses tend to require fewer technical conditions, compared to inference, because the primary performance metric is accuracy rather than the validity of a list of assumptions. On the other hand, robust, high-accuracy prediction models can require much more data than inferential models.
The third step of the 5A Method concludes with relevant analytic insights in the form of statistical models that we present to the stakeholder. For exploratory analyses, the results could be purely descriptive or they could hypothesize some associations within the data that warrant further research. For inferential analyses, we might offer conclusions from formal tests of prior hypotheses. Finally, if the research question implies prediction, we could identify important features and associated prediction accuracies. Regardless, in the next step of the 5A Method we must communicate analytic insights in the context of the stakeholder’s domain and advise on their significance, limitations, and implementation.
1.3.4 Advise on Results
Data scientists understandably expend a great deal of time and effort focusing on the acquisition and analysis of data. Four chapters of this text are devoted to these technical steps of the 5A Method. However, in real-world applications, even the best analytic results are of limited value if they cannot be effectively communicated to a stakeholder in the context of the problem domain. Strong non-technical competencies, such as interpersonal communication, are what truly set the best data scientists apart. Thus, we explore important considerations for advising stakeholders in every case study presented herein.
The first step in advising stakeholders on the results of an analysis is to interpret (aka translate) the insights into common language. Not all audiences have the same technical background in mathematical and computer sciences as a data scientist. Consequently, it is important to avoid discipline-specific jargon that might confuse or alienate stakeholders. Common language interpretations often include analogies, examples, or diagrams that are accessible to a wider audience or specific to the domain in question. Though it can be difficult to distill complex concepts into simpler terms, it is well worth the effort prior to formally presenting results.
After translating analytic results into common language, data scientists present said results to stakeholders in oral, visual, and written form. Clear and confident oral communication is paramount when it comes to assuring an audience of the validity and value of technical advice. Given a gathering of diverse stakeholders, widespread trust in the proposed solution is often attributed to the speaker just as much as the technical methods. Careful preparation and practice are vital to presenting in such a compelling manner. Another vital component of technical presentation is well-designed visual representations of the data and results. Just as a picture is worth a thousand words, a visualization is worth a thousand (or many more!) data points. In most cases, it is difficult for an audience to extract wisdom from a table of numbers or list of equations. Well-constructed data visualizations let the data “speak” for themselves and invite the audience to discover their own insights. The results of analyses are also commonly presented or published in written reports. Written communication is a more appropriate platform for the technical aspects of an analysis, particularly in the realm of publication. However, it is still important to describe the domain-specific implications of the results.
When advising on results it is also important to explain the significance of the associated conclusions. But there are two types of significance when it comes to solving data science problems. The first type is statistical significance. This is the more technical type and it has to do with how confident the stakeholder can be in the proposed conclusions. Said another way, statistical significance determines the likelihood of committing an error when implementing the proposed solution. The second type of significance is practical. We distinguish the two because it is possible for a conclusion to be statistically significant while not being practically significant. Given enough data, we can often detect statistically significant differences between any two groups. However, those differences could be trivial from a practical standpoint.
Along with significance, data scientists must clearly communicate the limitations of analytic results. This is the place to consider the ethical implications of the recommended solution. Stakeholders must be informed of the conditions in which the results can and cannot be applied. Data scientists must apprise stakeholders of known biases in the original data and obvious populations to which the conclusions do not pertain. Solutions to problems associated with stochastic systems are never universal. There are always limitations and data scientists have an ethical obligation to identify them with due diligence. In addition to ethical concerns, analytic results often come with technical requirements. Certain modeling approaches demand a minimum amount of data or assumptions regarding the process from which the data was generated. Similarly, certain solutions might necessitate particular hardware, software, or connectivity in order to be implemented. Such limitations must be presented to stakeholders, along with the results.
After presenting the significance and limitations of results in domain-specific terms, data scientists should offer advice on implementation. For an exploratory analysis, this might comprise a plan for future research. Following the discovery of compelling trends, patterns, or associations it is often prudent to establish their broader validity in a formal inferential or predictive analysis. If the analysis was already inferential, then implementation advise could include ways to leverage confirmed trends, patterns, or associations to gain a competitive advantage in the domain. Sometimes such advice involves dramatic changes to business practices and is met with resistance due to the inertia of traditional thinking. Stakeholders must be willing to innovate when faced with opportunities to leverage data-driven insights. Finally, for a predictive analysis, data scientists might advise on deployment strategies for the associated model. In common practice, predictive models ingest new data, execute algorithms, and turnout results in an automated fashion. Such a process involves seamless integration with data sources on the back end and often public, online presentation of the predictions on the front end. Successful data scientists are familiar with this entire back end to front end process.
In the 5A Method, the step of advising stakeholders on the insights offered from a data analysis is distinct from the culminating step of answering the original research question. This distinction is intentional and important. Since we seldom ask and answer compelling questions on our own, the 5A Method is a collaborative process. The advising step exemplifies this collaboration wherein we share technical wisdom with the stakeholders and the stakeholders fuse that wisdom with their domain expertise. Only after this combined effort can we truly answer the research question.
1.3.5 Answer the Question
Once the data scientist and stakeholder synthesize analytic results and domain expertise, the 5A Method concludes with an answer to the original question(s). Just as we prefer to ask good questions, we also prefer to render good answers. So, what comprises a “good” answer?
The first and potentially most important characteristic of a good answer is that it is aligned with the question. This might appear obvious, but solutions often address symptoms of a problem rather than the source. When crafting a good question, we seek specificity to ensure the subsequent analysis addresses the precise problem at hand. Similarly, a good answer matches the specificity of the question by solving the source problem rather than neighboring symptoms. Poor alignment can also be an issue of scope. If analysis and collaboration result in a limited solution that cannot match the broader scope of the original problem, then previous steps of the 5A Method must be revisited. Perhaps the original question was too ambitious, more data is required, or a different analysis approach is necessary. Regardless, the ultimate answer must be aligned with the question that initiated the problem solving process.
A good answer is also actionable within the constraints of the stakeholder’s domain. In practice, most questions must be answered to meet related deadlines. If an answer is required in a month, then a proposed solution that cannot be delivered for a year is of limited value. In order to act on a timely answer, stakeholders must have access to the technology and expertise to deploy the solution. Frequently, data science solutions are deployed by automating algorithms on cloud-based machines that draw from remote databases and deliver to web-based interfaces. A data scientist might provide the expertise to construct such a pipeline, but the stakeholder must invest in the technology.
Another characteristic of a good answer is that it is reproducible. An important method for ensuring answers are reproducible is proper documentation and publication. Whether internal or external, documenting the results of the entire 5A Method provides a valuable reference when replication is required. In most cases, solutions will be of limited value if they cannot be repeated by others. In order to avoid a return of the original problem, replicable solutions must also be sustainable. For complex answers, like the pipeline described in the previous paragraph, the question will likely reappear if the solution is not sustainable.
Answers to good questions often prompt more good questions. Thus, good answers are extendable to future research. Seldom does the answer to a single question close the book on a particular area of research. Instead, data scientists should use the provision of one answer as an opportunity to recommend follow-on work that strengthens or broadens their solution. Such a recommendation might include identifying shortfalls in the available data or deployment infrastructure. Stakeholders must then decide whether to invest the time, money, and effort to ask and answer additional good questions.
The data-driven problem solving process must begin and end with ethical concerns in mind. Unfortunately, answers to good questions sometimes unintentionally marginalize, publicize, or otherwise harm individuals or groups. Ethical data scientists seek to identify and remedy models and algorithms that exhibit such prejudiced behavior. If the harmful aspects of the solution cannot be eliminated, then the solution should not be deployed.
The 5A Method provides an organized problem solving process for data scientists to apply their knowledge, skills, and abilities to answer good research questions. While we demonstrate the entire method in multiple case studies, a significant portion of this text focuses on the technical details of the acquire and analyze steps. In the next chapter, we thoroughly introduce the acquisition process by examining how data is structured, imported, and wrangled.
1.4 Resources: Data Science
Below are the learning objectives associated with this chapter, as well as exercises to evaluate student understanding. Afterward, we provide recommendations for further study in the discipline of data science.
1.4.1 Learning Objectives
After completing this chapter, students should be able to:
- Highlight the key historical trailblazers and milestones in the discipline of data science.
- Describe the fundamental academic knowledge required of professional data scientists.
- Detail the current technical and interpersonal skills needed to conduct data science.
- Suggest how the abilities of data scientists can be applied in multiple, diverse domains.
- Define the key characteristics and ethical considerations of good research questions.
- Explain important attributes and collection methods for high-quality data and its sources.
- Distinguish between goals and methods for exploratory, inferential, and predictive analyses.
- Specify technical and non-technical considerations when advising stakeholders on results.
- Define the key characteristics and ethical considerations for good research answers.
1.4.3 Further Study
This chapter provides a brief introduction to the history, proficiency, and process of data science. However, there exist a variety of other resources addressing the discipline and practice of data science. We suggest a few of them below.
One of the more comprehensive textbooks currently available is Modern Data Science with R by Benjamin S. Baumer and others (Baumer et al. 2017). This text explores many different data science methods and tools that support problem solving processes like the 5A Method. Concepts are applied in many different domains, including sports, healthcare, and transportation. In addition to R, the authors also provide an introduction to SQL for data management.
Another excellent book that leverages R as the technical tool is An Introduction to Data Science by Jeffrey S. Saltz and Jeffrey M. Stanton (Saltz and Stanton 2017). Though less comprehensive than the Baumer et al. text, the authors present the essentials in a very approachable manner with plenty of examples and anecdotes. The associated code is embedded with the written explanation of its use throughout the book, which eases consumption for beginners.
For more of a focus specifically on coding skills, Hadley Wickham and Garrett Grolemund’s book R for Data Science is a great option (Wickham and Grolemund 2016). In addition to introductory coding in basic R, the authors demonstrate how to acquire, organize, visualize, and communicate data. There is also an important focus on repeatable workflows and documentation using RMarkdown.
As a final recommendation, The Art of Data Science by Roger D. Peng and Elizabeth Matsui is a fantastic introduction to the data-driven problem solving process (Peng and Matsui 2015). Similar to the 5A Method, the authors present a structure for solving exploratory, inferential, and predictive problems using data. There is a unique focus on formulating questions and communicating answers that all data scientists should find valuable.