Chapter 1 Introduction

Most social scientists perform statistical analysis on personal computers or on servers dedicated to statistical processing. While these environments do provide effective research platforms for the majority of projects, their inherent constraints on data size and processing speed limit the ability of social scientists to perform empirical analysis on large data sets.

We begin this guide by discussing: alternative frameworks for data storage and performing statistical analysis, when social scientists may benefit from these approaches and the solution developed by Research Programming to address the massive data needs of policy researchers at The Urban Institute (UI).

We subsequently provide an overview of the framework that supports the solution developed by Research Programming and best-practices for utilizing this solution. While understanding the mechanisms behind the framework in detail is not necessary to leverage the UI solution described below, a basic understanding will help researchers to perform their analyses more efficiently.

1.1 Outgrowing the server: When should social scientists consider distributing computing?

When working with data on a personal computer, or accessing a server, the system pulls data from physical storage (your hard drive) and loads it into your system’s memory while you work with it (e.g. Stata), or iteratively reads the data into and out of memory (e.g. SAS). Whether you are working from your personal computer’s hard drive and memory, or utilizing a server, the size of the data being considered is confined to the constraints of that system. For the majority of what is traditionally considered to be social science research, this framework is effective and convenient—as long as the data size is smaller than the available memory, and the software is capable of interpreting it, researchers can continue to look towards traditional frameworks for statistical analysis.

When data size outgrows this defined capacity level, however, relying on this framework results in inefficiencies and/or prohibitive costs for most organizations. In some cases, it is possible to work with data of greater size than that of your system memory by treating physical storage as “virtual” memory, but this easily results in protracted processing times and is unadvisable. An alternative approach is to acquire a larger server – this approach, however, is also not appropriate for most researchers working with massive data. While a larger server would allow researchers to work with more data, the size of the data that can be stored and analyzed on that server is still finite. Consequently, researchers experience a significant increase in computing cost by housing and maintaining a server that they may not continuously leverage or that may not be large enough to store some data sets (or have enough memory to complete certain processes). This is particularly true if multiple researchers that work with massive data share a server.

Subsequent sections of this manual describe the solution developed by UI Research Programming to address the needs of policy researchers that work with massive data. The system described below does not require UI to own and maintain a server capable of handling massive data. Our approach instead leverages scalable cloud computing hosted by Amazon Web Services (AWS) and the distributed computing platform Apache Spark (Spark). The framework is directly tied to the elasticity of researcher demand for computational capacity, simultaneously providing researchers with greater flexibility at reduced organizational costs.

When should researchers make the jump from the server to distributed computing? The answer depends on the size of the researcher’s data, the size of the server that the researcher currently relies on and the number of researchers sharing that server. Researchers should utilize distributed computing if their data is larger in size than the server memory and storage available to them (after sharing server resources with other researchers).

1.2 Making distributed computing straightforward & cost-effective

The solution developed by Research Programming to address massive data needs relies on cloud-based, distributed computing rather than an on-site server. The Spark distributed computing platform stores and processes data across a cluster of machines, distributing the data storage and processing tasks across the cluster.

The cloud computing services hosted by AWS allows researchers to “spin-up” and shut down groupings of virtual servers, called Elastic Compute Cloud (EC2) instances, as needed for analysis while permanently storing data in the AWS data storage infrastructure, S3. The instance types available to researchers are determined by combinations of memory and storage and networking capacity,1 and researchers are encouraged to consult with Research Programming in order to identify what instance type is most appropriate for their project.

This cloud-based, distributed framework effectively removes any computing constraint on the size of data that researchers can store and analyze. Researchers can rent any number of machines from AWS and then use Spark, which coordinates tasks between machines, to implement their data analysis and manipulation. Data is stored in S3, and then distributed in memory across the cluster machines when tasks need to be performed. This can be scaled to a massive degree – the largest known cluster to have utilized Spark is 8,000 machines and Spark has been shown to perform well processing up to several petabytes (each 1 million gigabytes) worth of data.2

Installation of Spark occurs during AWS cluster configuration so that researchers can immediately perform work with their data in the distributed environment, accessing Spark through a common programming language of their choice, once a cluster is finished spinning up. We describe and compare supported programming languages in subsequent sections of this manual.

  1. A complete list of EC2 instance types can be found at