1.4 Using software in research

Many people use spreadsheets (such as Microsoft Excel) for analysis of data in research.

Using spreadsheets requires extreme care; many extremely expensive and dangerous errors have been made due to using spreadsheets (AlTarawneh and Thorne 2017), including problems when reporting the 2020 COVID-19 pandemic.

Problems may emerge for many different reasons:

Spreadsheets can automatically change the entered data (for example, reformatting entries as dates if the spreadsheet thinks the data should be a date), even when not appropriate. This has had dire consequences (Ziemann, Eren, and El-Osta 2016).
Spreadsheets may include formulas with errors (R. R. Panko and Sprague Jr 1998), that are incredibly difficult to locate and hence fix (Galletta et al. 1996; R. Panko 2016; London and Slagter 2021).
Spreadsheets do not leave a record of how the data have been analysed or prepared; for example, formulas can be very difficult to understand and parse. Keeping a record of the analysis, preparation of variables, and other operations with the data are part of what is called reproducible research (Simons and Holmes 2019). Reproducibility ensures, among other advantages, that the results can be checked by the researchers and by others.
Excel has bugs (Keeling and Pavur 2004; Mélard 2014) even in very basic operations (Berger 2007; Hargreaves and McWilliams 2010). After trying to fix these bugs, sometimes they are made even worse (McCullough and Wilson 2002).

Spreadsheets can be used for research and analysis... but you must be very careful!

Many of the problems with using spreadsheets are due to human error, but spreadsheets make the errors hard to find. Some errors emerge because Excel is being used for purposes it is not really designed for (i.e., scientific analysis).

In this subject, we will usually show output from the statistical software package called R(R Core Team 2018), or other popular statistical software packages such as jamovi (The jamovi Project, n.d.) and SPSS (IBM Corp 2016).

Statistical software packages such as R, jamovi, and SPSS can help us to avoid such problems:

They are designed for large data sets
They allow for reproducible research
They allow for a high level of precision in formatting and data visualisation
With a little bit of programming, these software packages can be extremely powerful: with one line of code we can apply a change to an entire data set or part of a data set in an instant
They have been designed specifically for the types of statistics and data analysis we will be learning about in this subject.

References

AlTarawneh, Ghada, and Simon Thorne. 2017. “A Pilot Study Exploring Spreadsheet Risk in Scientific Research.” arXiv Preprint arXiv:1703.09785.

Berger, Roger L. 2007. “Nonstandard Operator Precedence in Excel.” Computational Statistics & Data Analysis 51 (6): 2788–91.

Galletta, Dennis F., Kathleen S. Hartzel, Susan E. Johnson, Jimmie L. Joseph, and Sandeep Rustagi. 1996. “Spreadsheet Presentation and Error Detection: An Experimental Study.” Journal of Management Information Systems 13 (3): 45–63.

Hargreaves, Bruce R., and Thomas P. McWilliams. 2010. “Polynomial Trendline Function Flaws in Microsoft Excel.” Computational Statistics & Data Analysis 54 (4): 1190–96.

IBM Corp. 2016. IBM SPSS Statistics for Windows, Version 24.0. Armonk, NY: IBM Corp.

Keeling, Kellie B., and Robert J. Pavur. 2004. “Numerical Accuracy Issues in Using Excel for Simulation Studies.” In Proceedings of the 2004 Winter Simulation Conference, 2004, 2:1513–18. IEEE.

London, R. E., and H. A. Slagter. 2021. “Statement of Retraction: Effects of Transcranial Direct Current Stimulation over Left Dorsolateral pFC on the Attentional Blink Depend on Individual Baseline Performance.” Journal of Cognitive Neuroscience, 1. https://doi.org/https://doi.org/10.1162/jocn_x_01680.

McCullough, B. D., and Berry Wilson. 2002. “On the Accuracy of Statistical Procedures in Microsoft Excel 2000 and Excel XP.” Computational Statistics & Data Analysis 40 (4): 713–21.

Mélard, Guy. 2014. “On the Accuracy of Statistical Procedures in Microsoft Excel 2010.” Computational Statistics 29 (5): 1095–1128.

Panko, Ray. 2016. “What We Don’t Know about Spreadsheet Errors Today: The Facts, Why We Don’t Believe Them, and What We Need to Do.” arXiv Preprint arXiv:1602.02601.

Panko, Raymond R., and Ralph H. Sprague Jr. 1998. “Hitting the Wall: Errors in Developing and Code Inspecting a ‘Simple’ Spreadsheet Model.” Decision Support Systems 22 (4): 337–53.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Simons, Janet E., and Daniel T. Holmes. 2019. “Reproducible Research and Reports with R.” Journal of Applied Laboratory Medicine 4 (3): 471–73.

The jamovi Project. n.d. jamovi (Version 1.0) [Computer Software]. https://www.jamovi.org.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1): 1–3.