Chapter 14 Additional resources

This chapter is not part of the required curriculum for the course HE-930. Below, you can find some selected optional resources related to the course or to predictive analytics and machine learning.

14.1 External resources

Below are other resources related to the process of conducting machine learning and/or coding on your own:

Possible sources of datasets:

14.2 Frequently asked questions (FAQs)

Below are some common questions and answers that you might find useful.

14.2.2 Predictive analytics and machine learning in general

  • I want to make some minor edits to my dataset. For example, during data entry, someone wrote “30 days” instead of writing “30”, and I want to delete “days”. Is it best to make this change in Excel or R? This type of edit is likely easier to do in Excel instead of R, if it is just one or two edits. Another benefit of making the change in Excel is so that the master copy of your data is the correct version (as opposed to having to make the change every time you load it into R). However, if there are many changes to make (like removing the word “days” from 1000 cells), automating them in R could be the most efficient strategy. The instructors can help you write the code to do this, so that you do not have to make tens or hundreds of changes manually in Excel. (The find and replace feature in Excel could also be helpful and simpler sometimes and is worth exploring, depending on the situation).

  • Do classification problems with more than 2 categories of the dependent variable require a very different approach than the classification methods we learn in this course? No, the approach is the same and most of the code from our course should work automatically on a dependent variable with 3 or more categories. However, there will be some differences in how you evaluate the results. You will need a 3 x 3 confusion matrix52 instead of the 2 x 2 one that we typically practice with. In this scenario, you might also have to redefine how certain metrics like sensitivity and specificity are calculated.

  • When calculating confusion matrix metrics automatically in R, why don’t my hand-calculated metrics match those calculated by R? This is a common problem. Often, the reason for this is that the computer doesn’t know which are the actual and predicted values in our confusion matrix (CM), so it flips them compared to what we want. It is best to look at both the CM that you made yourself and the one outputted by a package in R that is doing all the calculations for you. You can then compare them to see if they are the same or different. Another piece of advice: You can always figure out what the computer was calculating for any given metric, just by trying all the combinations of calculations and seeing what R meant by each one.

  • Can machine learning be used to gather and analyze text data? This is a very active area of research and application at the moment. There are many tools available to do this. I am not an expert in this but I do have some basic familiarity which I will share below. I might not use all of the latest terms that are being used or I might not be aware of all of the latest technology. The process of gathering data from a website or other source is called scraping. You tell the computer what to look for and then then it goes through the entire page and captures what you want. There’s a Python package called Beautiful Soup that does this. R must also have packages for it. Then once you have your data, which is a collection of text (perhaps in a spreadsheet), there are methods to analyze that text and learn from it. That process is called NLP (natural language processing). There are many tools out there for NLP in R and other platforms. ChatGPT is an application of NLP that has received widespread attention. Lots of experimentation is being done to use NLP to analyze qualitative data. It also has many business applications, like the targeted advertising example that we discuss at times in our class. If you would like to do a final project involving NLP, please let the instructors know.

14.3 Examples and commentaries of machine learning

14.3.1 Machine learning and coding process

14.3.2 Healthcare

14.3.3 Ecology

14.3.4 Ethics

14.3.5 Graphics, art, and images

  • https://openai.com/dall-e-2/ – Apparently you can give a text description of an image to this system and it will draw the image for you. See the demonstration video.

14.3.6 School safety, violence, and shootings

14.3.7 Language processing


  1. Or 4 x 4 if your dependent variable has four levels!↩︎