Chapter 14 Additional resources

This chapter is not part of the required curriculum for the course HE-930. Below, you can find some selected optional resources related to the course or to predictive analytics and machine learning more broadly.

14.1 External resources

Below are other resources related to the process of conducting machine learning and/or coding on your own:

“Chapter 15 Appendix 1 – Selected additional R code and resources” in https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/appendix-1-selected-additional-r-code-and-resources.html. – This contains many code and tips related to using R and RStudio.
MIT 6.S191 Introduction to Deep Learning. introtodeeplearning.com. – A course about deep learning from MIT.

Possible sources of datasets:

Google Dataset Search. https://datasetsearch.research.google.com/.
Best Public Datasets for Machine Learning and Data Science. January 6 2021. Stacy Stanford, Roberto Iriondo, Pratik Shukla. https://pub.towardsai.net/best-datasets-for-machine-learning-data-science-computer-vision-nlp-ai-c9541058cf4f.
20 Free Life Sciences, Healthcare, and Medical Datasets for Machine Learning. July 16 2021. iMerit. https://imerit.net/blog/20-free-life-sciences-healthcare-and-medical-datasets-for-machine-learning-all-pbm/.

14.1.1 Machine learning for longitudinal data

Package LongituRF: Random Forests for Longitudinal Data, by Louis Capitaine. https://cran.r-project.org/web/packages/LongituRF/LongituRF.pdf.

14.2 Frequently asked questions (FAQs)

Below are some common questions and answers that you might find useful.

14.2.2 Predictive analytics and machine learning in general

I want to make some minor edits to my dataset. For example, during data entry, someone wrote “30 days” instead of writing “30”, and I want to delete “days”. Is it best to make this change in Excel or R? This type of edit is likely easier to do in Excel instead of R, if it is just one or two edits. Another benefit of making the change in Excel is so that the master copy of your data is the correct version (as opposed to having to make the change every time you load it into R). However, if there are many changes to make (like removing the word “days” from 1000 cells), automating them in R could be the most efficient strategy. The instructors can help you write the code to do this, so that you do not have to make tens or hundreds of changes manually in Excel. (The find and replace feature in Excel could also be helpful and simpler sometimes and is worth exploring, depending on the situation).
Do classification problems with more than 2 categories of the dependent variable require a very different approach than the classification methods we learn in this course? No, the approach is the same and most of the code from our course should work automatically on a dependent variable with 3 or more categories. However, there will be some differences in how you evaluate the results. You will need a 3 x 3 confusion matrix⁵² instead of the 2 x 2 one that we typically practice with. In this scenario, you might also have to redefine how certain metrics like sensitivity and specificity are calculated.
When calculating confusion matrix metrics automatically in R, why don’t my hand-calculated metrics match those calculated by R? This is a common problem. Often, the reason for this is that the computer doesn’t know which are the actual and predicted values in our confusion matrix (CM), so it flips them compared to what we want. It is best to look at both the CM that you made yourself and the one outputted by a package in R that is doing all the calculations for you. You can then compare them to see if they are the same or different. Another piece of advice: You can always figure out what the computer was calculating for any given metric, just by trying all the combinations of calculations and seeing what R meant by each one.
Can machine learning be used to gather and analyze text data? This is a very active area of research and application at the moment. There are many tools available to do this. I am not an expert in this but I do have some basic familiarity which I will share below. I might not use all of the latest terms that are being used or I might not be aware of all of the latest technology. The process of gathering data from a website or other source is called scraping. You tell the computer what to look for and then then it goes through the entire page and captures what you want. There’s a Python package called Beautiful Soup that does this. R must also have packages for it. Then once you have your data, which is a collection of text (perhaps in a spreadsheet), there are methods to analyze that text and learn from it. That process is called NLP (natural language processing). There are many tools out there for NLP in R and other platforms. ChatGPT is an application of NLP that has received widespread attention. Lots of experimentation is being done to use NLP to analyze qualitative data. It also has many business applications, like the targeted advertising example that we discuss at times in our class. If you would like to do a final project involving NLP, please let the instructors know.

14.3 Examples and commentaries of machine learning

14.3.1 Machine learning and coding process

Walsh B. Aug 11 2021. Meet the AIs that can write — and code. Axios. https://www.axios.com/2021/08/11/openai-ai21-labs-natural-language-process-ai.

14.3.2 Healthcare

Anaya JM, Rojas M, Salinas ML, Rodríguez Y, Roa G, Lozano M, Rodríguez-Jiménez M, Montoya N, Zapata E, Monsalve DM, Acosta-Ampudia Y. Post-COVID syndrome. A case series and comprehensive review. Autoimmunity reviews. 2021 Nov 1;20(11):102947. https://doi.org/10.1016/j.autrev.2021.102947 – This study used unsupervised machine learning to identify clusters of COVID-19 patients, in addition to other methods.
Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen LC, Correa R, Dullerud N, Ghassemi M, Huang SC, Kuo PC. AI recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health. 2022 May 11. https://doi.org/10.1016/S2589-7500(22)00063-2.
Henry, K.E., Adams, R., Parent, C. et al. Factors driving provider adoption of the TREWS machine learning-based early warning system and its effects on sepsis treatment timing. Nat Med 28, 1447–1454 (2022). https://doi.org/10.1038/s41591-022-01895-z
Landro L. 2022. “How Hospitals Are Using AI to Save Lives.” The Wall Street Journal. https://www.wsj.com/articles/how-hospitals-are-using-ai-to-save-lives-11649610000.
McNemar E. 2021. “What Are the Benefits of Predictive Analytics in Healthcare?” https://healthitanalytics.com/news/what-are-the-benefits-of-predictive-analytics-in-healthcare?eid=CXTEL000000637400&elqCampaignId=20617&utm_source=nl&utm_medium=email&utm_campaign=newsletter&elqTrackId=ede36dffced349ccb0cbbdd209448ea1&elq=3b65f6570b90459d816e093a33cf641a&elqaid=21496&elqat=1&elqCampaignId=20617.
Burk-Rafel J, Reinstein I, Park YS. Identifying Meaningful Patterns of Internal Medicine Clerkship Grading Distributions: Application of Data Science Techniques Across 135 US Medical Schools. Academic Medicine. 2022 Oct 25:10-97. http://doi.org/10.1097/ACM.0000000000005044
Ramesh J, Aburukba R, Sagahyroon A. A remote healthcare monitoring framework for diabetes prediction using machine learning. Healthcare Technology Letters. 2021 Jun;8(3):45. https://doi.org/10.1049/htl2.12010
SZALAVITZ M. Aug 11 2021. The Pain Was Unbearable. So Why Did Doctors Turn Her Away? Wired. https://www.wired.com/story/opioid-drug-addiction-algorithm-chronic-pain/?utm_source=onsite-share&utm_medium=email&utm_campaign=onsite-share&utm_brand=wired.

14.3.3 Ecology

Kuta S. May 31 2022. A New Candidate for Oldest Tree in the World Is Discovered in Chile. https://www.smithsonianmag.com/smart-news/a-new-candidate-for-oldest-tree-in-the-world-is-discovered-in-chile-180980167.

14.3.4 Ethics

Arogyaswamy B. Big tech and societal sustainability: an ethical framework. AI & society. 2020 Dec;35(4):829-40. https://doi.org/10.1007/s00146-020-00956-6.

14.3.5 Graphics, art, and images

https://openai.com/dall-e-2/ – Apparently you can give a text description of an image to this system and it will draw the image for you. See the demonstration video.

14.3.6 School safety, violence, and shootings

Cyphert A. Tinker-ing with machine learning: The legality and consequences of online surveillance of students. Nevada Law Journal. 2020 May;20. Abstract: https://ssrn.com/abstract=3602011. PDF: https://scholars.law.unlv.edu/cgi/viewcontent.cgi?article=1813&context=nlj.
K. M. Ikegwu, E. Sowells and H. Hardiman, “Exploring technological preventive methods for school shootings,” SoutheastCon 2015, 2015, pp. 1-6. http://doi.org/10.1109/SECON.2015.7132884.
Lu, S., Christie, G., Nguyen, T., Freeman, J., & Hsu, E. (2021). Applications of Artificial Intelligence and Machine Learning in Disasters and Public Health Emergencies. Disaster Medicine and Public Health Preparedness, 1-8. http://doi.org/10.1017/dmp.2021.125.

14.3.7 Language processing

Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021 Mar 3 (pp. 610-623). DOI: https://doi.org/10.1145/3442188.3445922. PDF: https://dl.acm.org/doi/pdf/10.1145/3442188.3445922.
Gebru T, Mitchell M. June 17 2022. “We warned Google that people might believe AI was sentient. Now it’s happening.” Washington Post. https://www.washingtonpost.com/opinions/2022/06/17/google-ai-ethics-sentient-lemoine-warning/. This is highly related to the Bender et al (2021) article.

Or 4 x 4 if your dependent variable has four levels!↩︎