Chapter 6 Test

Put your practice to the test. Here are some excellent cheatsheets to consider for biostats in R, and this is a useful read on good enough practices in scientific computing (Wilson et al. 2017). The goal here was not to become data scientists nor biostatisticians but to encourage you to consider developing and refining your critical thinking skills in the context of evidence, data, and statistical reasoning.

Learning outcomes

  1. Complete fundamental exploratory data analysis on a representative dataset culminating with a fair and reasonable statistical model.
  2. Interpret a statistical analyses that you completed with a focus on relevance, significance, and logic.
  3. Communicate biostatistical work clearly and effectively to others.

Critical thinking

At times in many disciplines of biological research, we need to be open to experimentation that is fair, transparent, and replicable but that is implemented based on available data. This experimentatation can also happen after we have data. It can be an exercise in fitting the most appropriate or parsimonous models (Cottingham, Lennon, and Brown 2005), applying experimental design principles (Ruxton and Colgrave 2018), and of course invoking critical thinking. This is not to say we are going on fishing expeditions, but that that at times, we have only certain data to describe a system and are tasked or obligated to use the best possible evidence we have to infer relevant processes. For instance, we might compile field data, data from online resources or data products for climate or landscapes, or reuse data on traits on genetics and link these different evidence streams together to explore a question. Critical thinking in statistics can be an important framework that we leverage to not only do the statistics and fit models but also ensure that we are able to ask the questions we need to. In summary, we have data and need an answer but have to use open and transparent thinking with statistics to find the best question.

Workflow for hackathons

A hackathon in data science and the computational science is a fixed-duration, collaborative endeavor to develop a solution for a focussed challenge. The goal is to have a reasonably functional first-approximation that is viable and/or describes the key processes for a system or dataset. It is a blend of hacking and marathon to race or sprint towards a clear endpoint in development. In the data and statistical sciences, we intensively work to deepen our understanding of evidence ideally with key data visualizations and a model that predicts or describes key outcomes. The advantage of setting a reasonably short but fair duration is that it reduces the likelihood that tangents are unduly developed. It also hones your coding, research skills, and statistical reasoning through practiced mental model application of statistics to new data to tell a balanced and reproducible, transparent story.

  1. Get the data.

  2. Read the metadata (and if you get stuck, look up from online resources or related/similar datasets the potential meaning of opaque variable names). Nomenclature and annotation shorthand terminology in a field can be highly specific at times.

  3. Consider and ensure that you understand the individual vectors or variables (inspect the dataframe).

  4. Develop an informal or formal data map - picture a Sankey diagram (conceptual semantic visualization of relationships between variables).

  5. Dig into online resources or literature to ideate on important questions, novel gaps, key theories, or even basic fundamental science that supports these data.

  6. Decide on focus and key purpose and begin to plan out an analytical workflow.

  7. Determine if you have sufficient data, i.e., consider if you need to augment these data. Augmenting data can be from novel data sources or from reclassification of existing data.

  8. Begin your exploration of the dimensions and scope of the variables you were interested in using (skimr, min, max, fitdistrplus, or str-like functions or tools in R).

  9. Now, adopt the r4ds workflow such as Fig 1.1, and use plots such as histograms or boxplots to understand depth and range of data, use basic tests as the t.test to explore differences, and prepare for your final statistical model and keystone plot to show the differences you tested.

  10. Code and test your main model to address the overarching goal. Decide and revise the best/most representative instance of data viz that illuminates the salient process or patterns examined.

If you favor this method of collaborative work in your lab or team, here are ten simple rules to run a successful BioHackathon.

Test adventure time

York University, Keele Campus is a small urban forest mixed with grasslands and open space. The master gardeners measured nearly 7000 trees over the course of two years. These data were recently compiled and published. There are many fascinating and compelling questions to explore that can support evidence-informed decisions and valuation estimates for this place ecologically, environmentally, and from a trait or species-level perspective. This challenge as a summative test is thus relatively more open ended. Given these data, collected and now published, what can we do to enhance our biological and social understanding and appreciation for a university campus that support people, other animals, and plants. Explore the data, define a relevant challenge or set of questions that would benefit the stakeholders or local community or inform our understanding of a biological theory, and demonstrate your mastery of critical thinking in statistics. Submit your work to turnitin.com as PDF including the code, annotation, rationale, interpretation, and outputs from the viz, EDA, and model(s) that supported your thinking.

Metadata for test data

attribute description units
FID FID refer to an unique identifier of an object within a table in ArcGIS data none
OBJECTID unique instance of measurement counting rows none
Date month, day, year format none
Block block that York uses in some maps to organize campus into grid none
Street_or_ road names none
Building_C building code none
Tree_Tag_N number on the metal tag affixed to each tree on campus none
Species_Co species code acronyms used to abbreviate species names none
Common_Nam not the Latin binomial name for a species, the common name used none
Genus genus is a taxonomic unit that may contain one species (monotypic) none
Species most basic category in the system of taxonomy none
DBH diameter at breast height measured at approximately 1.3 m (4.3 ft) cm
Number_of_ number of main branches count
Percentage percentage of canopy cover total out of 100
Crown_Widt width of the crown feet
Total_Heig the total height of tree to the top of canopy, actual top of tree feet
Latitude degrees decimals, a notation for expressing latitude and longitude geographic coordinates as decimal fractions of a degree decimal degree
Longitude degrees decimals, a notation for expressing latitude and longitude geographic coordinates as decimal fractions of a degree decimal degree
Height_to_ height to first branch of the main trunk of the tree feet
Unbalanced the number of times a tree splits or branches out from main trunk number of splits
Reduced_Cr a measure of reduced crown treatment by the foresters on campus, number of branches removed count
Weak_Yello an indirect of tree health, Likert Score from 0 to 3 with 0 being no yellow and three significant yellow and evidence of weak branches ordinal data, score
Defoliatio an indirect of tree health, Likert Score from 0 to 3 with 0 being no evidence of leaf loss and 3 being significant loss ordinal data, score
Dead_Broke number of dead or broken branches count
Poor_Branc an indirect of tree health, Likert Score from 0 to 3 with 0 being no evidence of poor branches, and 3 being many ordinal data, score
Lean A tree that leans because it has grown towards the sun often has a curving trunk, score from 0 being upright at 90 degrees to ground and 3 being significant lean at 45 degrees ordinal data, score
Trunk_Scar number of tree scars on main trunk of tree count

Test data

library(tidyverse)
trees <- read_csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A1e738f4d-f491-4b40-b55a-e8395c5349ce"))  
trees
## # A tibble: 6,951 × 27
##      FID OBJECTID Date   Block Street_or_  Build…¹ Tree_…² Speci…³ Commo…⁴ Genus
##    <dbl>    <dbl> <chr>  <chr> <chr>         <dbl>   <dbl> <chr>   <chr>   <chr>
##  1     0        1 9/7/12 A     Stedman Le…      22       1 lochon  Honey … Gled…
##  2     1        2 9/7/12 A     Stedman Le…      22       2 lochon  Honey … Gled…
##  3     2        3 9/7/12 A     Stedman Le…      22       3 lochon  Honey … Gled…
##  4     3        4 9/7/12 A     Stedman Le…      22       4 lochon  Honey … Gled…
##  5     4        5 9/7/12 A     Stedman Le…      22       5 lochon  Honey … Gled…
##  6     5        6 9/7/12 A     Stedman Le…      22       6 lochon  Honey … Gled…
##  7     6        7 9/7/12 A     Stedman Le…      22       7 lochon  Honey … Gled…
##  8     7        8 9/7/12 A     Stedman Le…      22       8 lochon  Honey … Gled…
##  9     8        9 9/7/12 A     Stedman Le…      22       9 lochon  Honey … Gled…
## 10     9       10 9/7/12 A     Stedman Le…      22      10 lochon  Honey … Gled…
## # … with 6,941 more rows, 17 more variables: Species <chr>, DBH <dbl>,
## #   Number_of_ <dbl>, Percentage <dbl>, Crown_Widt <dbl>, Total_Heig <dbl>,
## #   Latitude <dbl>, Longitude <dbl>, Height_to_ <dbl>, Unbalanced <dbl>,
## #   Reduced_Cr <dbl>, Weak_Yello <dbl>, Defoliatio <dbl>, Dead_Broke <dbl>,
## #   Poor_Branc <dbl>, Lean <dbl>, Trunk_Scar <dbl>, and abbreviated variable
## #   names ¹​Building_C, ²​Tree_Tag_N, ³​Species_Co, ⁴​Common_Nam

Clean code

Effective coding so that others can read it and understand it - not just machines - is an art and a science. Object and function naming that is intuitive really helps. Functions to streamline repeated operations, and annotation to explain steps with headers are all useful. This approach to literate coding for humans is sometimes entitled ‘clean code’. Here is a short paper with some tips and tricks relevant to your work when you need to share it (Filazzola and Lortie 2022).

Rubric

Remember, we are working together to hone our statistical reasoning skills.

The goal is to tell a story with these data.

It does not need to be super complex, but it does need to showcase your skills in understanding key principles such as a GLM with appropriate data visualizations - but any reasonable test that MATCHES the story you tell is great.

Show your work of exploring the data in plots and basic stats, develop your idea, test it, and then have a final key plot showing the relationship you tested.

item concept description value
1 effective data viz are there figures exploring the data and is the final main figure publishable in terms of legends, labels, axes, appropriateness 10
2 effective EDA is the distribution of and relationship between variables explored 5
3 final data model(s) does the final model(s) address the purpose of study, appropriate, and assumptions including fit of model explored 5
4 annotation and reporting is there annotation in the r-code chunks, reporting in the markdown, and an interpretation even briefly of what you found and why 5
5 total sum of above 25