Chapter 6 Test
Put your practice to the test. Here are some excellent cheatsheets to consider for biostats in R, and this is a useful read on good enough practices in scientific computing (Wilson et al. 2017). The goal here was not to become data scientists nor biostatisticians but to encourage you to consider developing and refining your critical thinking skills in the context of evidence, data, and statistical reasoning.
Learning outcomes
- Complete fundamental exploratory data analysis on a representative dataset culminating with a fair and reasonable statistical model.
- Interpret a statistical analyses that you completed with a focus on relevance, significance, and logic.
- Communicate biostatistical work clearly and effectively to others.
Critical thinking
At times in many disciplines of biological research, we need to be open to experimentation that is fair, transparent, and replicable but that is implemented based on available data. This experimentatation can also happen after we have data. It can be an exercise in fitting the most appropriate or parsimonous models (Cottingham, Lennon, and Brown 2005), applying experimental design principles (Ruxton and Colgrave 2018), and of course invoking critical thinking. This is not to say we are going on fishing expeditions, but that that at times, we have only certain data to describe a system and are tasked or obligated to use the best possible evidence we have to infer relevant processes. For instance, we might compile field data, data from online resources or data products for climate or landscapes, or reuse data on traits on genetics and link these different evidence streams together to explore a question. Critical thinking in statistics can be an important framework that we leverage to not only do the statistics and fit models but also ensure that we are able to ask the questions we need to. In summary, we have data and need an answer but have to use open and transparent thinking with statistics to find the best question.
Workflow for hackathons
A hackathon in data science and the computational science is a fixed-duration, collaborative endeavor to develop a solution for a focussed challenge. The goal is to have a reasonably functional first-approximation that is viable and/or describes the key processes for a system or dataset. It is a blend of hacking and marathon to race or sprint towards a clear endpoint in development. In the data and statistical sciences, we intensively work to deepen our understanding of evidence ideally with key data visualizations and a model that predicts or describes key outcomes. The advantage of setting a reasonably short but fair duration is that it reduces the likelihood that tangents are unduly developed. It also hones your coding, research skills, and statistical reasoning through practiced mental model application of statistics to new data to tell a balanced and reproducible, transparent story.
Get the data.
Read the metadata (and if you get stuck, look up from online resources or related/similar datasets the potential meaning of opaque variable names). Nomenclature and annotation shorthand terminology in a field can be highly specific at times.
Consider and ensure that you understand the individual vectors or variables (inspect the dataframe).
Develop an informal or formal data map - picture a Sankey diagram (conceptual semantic visualization of relationships between variables).
Dig into online resources or literature to ideate on important questions, novel gaps, key theories, or even basic fundamental science that supports these data.
Decide on focus and key purpose and begin to plan out an analytical workflow.
Determine if you have sufficient data, i.e., consider if you need to augment these data. Augmenting data can be from novel data sources or from reclassification of existing data.
Begin your exploration of the dimensions and scope of the variables you were interested in using (skimr, min, max, fitdistrplus, or str-like functions or tools in R).
Now, adopt the r4ds workflow such as Fig 1.1, and use plots such as histograms or boxplots to understand depth and range of data, use basic tests as the t.test to explore differences, and prepare for your final statistical model and keystone plot to show the differences you tested.
Code and test your main model to address the overarching goal. Decide and revise the best/most representative instance of data viz that illuminates the salient process or patterns examined.
If you favor this method of collaborative work in your lab or team, here are ten simple rules to run a successful BioHackathon.
Test adventure time
York University, Keele Campus is a small urban forest mixed with grasslands and open space. The master gardeners measured nearly 7000 trees over the course of two years. These data were recently compiled and published. There are many fascinating and compelling questions to explore that can support evidence-informed decisions and valuation estimates for this place ecologically, environmentally, and from a trait or species-level perspective. This challenge as a summative test is thus relatively more open ended. Given these data, collected and now published, what can we do to enhance our biological and social understanding and appreciation for a university campus that support people, other animals, and plants. Explore the data, define a relevant challenge or set of questions that would benefit the stakeholders or local community or inform our understanding of a biological theory, and demonstrate your mastery of critical thinking in statistics. Submit your work to turnitin.com as PDF including the code, annotation, rationale, interpretation, and outputs from the viz, EDA, and model(s) that supported your thinking.
Metadata for test data
attribute | description | units |
---|---|---|
FID | FID refer to an unique identifier of an object within a table in ArcGIS data | none |
OBJECTID | unique instance of measurement counting rows | none |
Date | month, day, year format | none |
Block | block that York uses in some maps to organize campus into grid | none |
Street_or_ | road names | none |
Building_C | building code | none |
Tree_Tag_N | number on the metal tag affixed to each tree on campus | none |
Species_Co | species code acronyms used to abbreviate species names | none |
Common_Nam | not the Latin binomial name for a species, the common name used | none |
Genus | genus is a taxonomic unit that may contain one species (monotypic) | none |
Species | most basic category in the system of taxonomy | none |
DBH | diameter at breast height measured at approximately 1.3 m (4.3 ft) | cm |
Number_of_ | number of main branches | count |
Percentage | percentage of canopy cover | total out of 100 |
Crown_Widt | width of the crown | feet |
Total_Heig | the total height of tree to the top of canopy, actual top of tree | feet |
Latitude | degrees decimals, a notation for expressing latitude and longitude geographic coordinates as decimal fractions of a degree | decimal degree |
Longitude | degrees decimals, a notation for expressing latitude and longitude geographic coordinates as decimal fractions of a degree | decimal degree |
Height_to_ | height to first branch of the main trunk of the tree | feet |
Unbalanced | the number of times a tree splits or branches out from main trunk | number of splits |
Reduced_Cr | a measure of reduced crown treatment by the foresters on campus, number of branches removed | count |
Weak_Yello | an indirect of tree health, Likert Score from 0 to 3 with 0 being no yellow and three significant yellow and evidence of weak branches | ordinal data, score |
Defoliatio | an indirect of tree health, Likert Score from 0 to 3 with 0 being no evidence of leaf loss and 3 being significant loss | ordinal data, score |
Dead_Broke | number of dead or broken branches | count |
Poor_Branc | an indirect of tree health, Likert Score from 0 to 3 with 0 being no evidence of poor branches, and 3 being many | ordinal data, score |
Lean | A tree that leans because it has grown towards the sun often has a curving trunk, score from 0 being upright at 90 degrees to ground and 3 being significant lean at 45 degrees | ordinal data, score |
Trunk_Scar | number of tree scars on main trunk of tree | count |
Test data
library(tidyverse)
<- read_csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A1e738f4d-f491-4b40-b55a-e8395c5349ce"))
trees trees
## # A tibble: 6,951 × 27
## FID OBJECTID Date Block Street_or_ Build…¹ Tree_…² Speci…³ Commo…⁴ Genus
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 0 1 9/7/12 A Stedman Le… 22 1 lochon Honey … Gled…
## 2 1 2 9/7/12 A Stedman Le… 22 2 lochon Honey … Gled…
## 3 2 3 9/7/12 A Stedman Le… 22 3 lochon Honey … Gled…
## 4 3 4 9/7/12 A Stedman Le… 22 4 lochon Honey … Gled…
## 5 4 5 9/7/12 A Stedman Le… 22 5 lochon Honey … Gled…
## 6 5 6 9/7/12 A Stedman Le… 22 6 lochon Honey … Gled…
## 7 6 7 9/7/12 A Stedman Le… 22 7 lochon Honey … Gled…
## 8 7 8 9/7/12 A Stedman Le… 22 8 lochon Honey … Gled…
## 9 8 9 9/7/12 A Stedman Le… 22 9 lochon Honey … Gled…
## 10 9 10 9/7/12 A Stedman Le… 22 10 lochon Honey … Gled…
## # … with 6,941 more rows, 17 more variables: Species <chr>, DBH <dbl>,
## # Number_of_ <dbl>, Percentage <dbl>, Crown_Widt <dbl>, Total_Heig <dbl>,
## # Latitude <dbl>, Longitude <dbl>, Height_to_ <dbl>, Unbalanced <dbl>,
## # Reduced_Cr <dbl>, Weak_Yello <dbl>, Defoliatio <dbl>, Dead_Broke <dbl>,
## # Poor_Branc <dbl>, Lean <dbl>, Trunk_Scar <dbl>, and abbreviated variable
## # names ¹Building_C, ²Tree_Tag_N, ³Species_Co, ⁴Common_Nam
Clean code
Effective coding so that others can read it and understand it - not just machines - is an art and a science. Object and function naming that is intuitive really helps. Functions to streamline repeated operations, and annotation to explain steps with headers are all useful. This approach to literate coding for humans is sometimes entitled ‘clean code’. Here is a short paper with some tips and tricks relevant to your work when you need to share it (Filazzola and Lortie 2022).
Rubric
Remember, we are working together to hone our statistical reasoning skills.
The goal is to tell a story with these data.
It does not need to be super complex, but it does need to showcase your skills in understanding key principles such as a GLM with appropriate data visualizations - but any reasonable test that MATCHES the story you tell is great.
Show your work of exploring the data in plots and basic stats, develop your idea, test it, and then have a final key plot showing the relationship you tested.
item | concept | description | value |
---|---|---|---|
1 | effective data viz | are there figures exploring the data and is the final main figure publishable in terms of legends, labels, axes, appropriateness | 10 |
2 | effective EDA | is the distribution of and relationship between variables explored | 5 |
3 | final data model(s) | does the final model(s) address the purpose of study, appropriate, and assumptions including fit of model explored | 5 |
4 | annotation and reporting | is there annotation in the r-code chunks, reporting in the markdown, and an interpretation even briefly of what you found and why | 5 |
5 | total | sum of above | 25 |