Chapter 2 Getting Started with Spark

2.1 What About Stata and SAS?

Unfortunately, neither Stata nor SAS are supported by Spark. The framework developed by Research Programming leverages PySpark and SparkR (the Spark forms of Python and R, respectively), enabling researchers to perform Spark operations using either language. Below are links to tutorials written by Research Programming to help guide researchers through tasks typically performed by social scientists, but driven by Spark:

           PySpark (Python): https://github.com/UrbanInstitute/pyspark-tutorials

           SparkR (R): https://github.com/UrbanInstitute/sparkr-tutorials

Note: The tutorials linked above assume that readers have some familiarity with Python and R, respectively. Please, reach out to Research Programming if you’re interested in working with a massive data set, but are not familiar with Python or R.

2.2 Choosing Between SparkR and PySpark

If you have a strong preference of language for R or Python, you should let that preference guide your decision. SparkR and PySpark are not dramatically different in terms of speed or available functionality:

  • Speed: Each of the Spark language implementations (R, Python, Java and Scala) communicates with the Spark data type DataFrame, which is conceptually equivalent to a table in a relational database or a data.frame/DataFrame in R/Python. Spark DataFrames use Scala and Java (languages that compose core Spark) to optimize operations so that the execution speeds for standard data processing tasks written in any of these languages are functionally the same.
  • Available Functionality: There is some difference in Spark 1.6 between SparkR and PySpark concerning the availability of core Spark functionality, with PySpark having greater functionality. However, Spark 2.0 significantly reduces the functionality gap between the two.

That being said, there are some notable differences that might affect your decision:

  • Code Syntax: SparkR feels a bit more like normal R than PySpark feels like Python, although both have peculiarities. This is perhaps the case since R is natively built around data.frames, which are analogous to Spark DataFrames. For this reason, you might prefer SparkR if you’re equally competent at R and Python. Both of the tutorials linked above are oriented around the Spark DataFrame data type.
  • Open Source Development: It seems like Python has the advantage here, in that more of the packages extending the functionality come written in or are accessible through PySpark than SparkR. This is likely a product of the current Spark community – more data scientists and engineers than traditional statisticians and social scientists.