Part 11 CODE: Using AI to Help You Succeed in Data and Decisions
In this course, you are encouraged to use AI tools to support you in programming in R and advancing your knowledge of course concepts. In particular, there are four valuable use cases for AI in this course, which we will call the CODE framework:
Craft R code from scratch, guided by informative prompts from you.
Observe an existing piece of R code and describe its function and purpose.
Debug a piece of R code that either you or the AI has written, to identify and solve coding errors.
Explain fundamental statistical concepts.
In this short tutorial, we will demonstrate a few examples of each of the four applications above. Throughout the course, we will use the AI model within the Notebook feature in Microsoft’s CoPilot (copilot.microsoft.com), which you have access to through your UCLA Logon ID. There are a number of other AI models you may use (e.g., Claude, Cursor, Replit, ChatGPT, Gemini), although the results may differ across models.
NOTE: This document assumes you are broadly familiar with how to use AI large-language models (LLMs). If you have no previous experience, please watch the “Learn the basics of prompting” videos here.
11.1 Craft
One of the most powerful use case of AI tools in this course is writing R code from scratch. As with all AI prompting, the more precise you are about the result you want from the AI, the more likely you are to get it. Wherever possible, your prompt should direct the AI to use the specific R packages you’re interested in, which will help align the AI’s output with your end goal. Below, we demonstrate several prompts to CoPilot in in blue, followed by the R code it generates. In all examples, the R code output by CoPilot has been copied and pasted into this R Markdown file, and then executed.
To begin, suppose we wish to read in the movie data from Lecture 1 into an R data frame. We can generate code for this by using the following prompt:
I’d like you to help me write some R code. I would like to use the tidyverse package for handling data. Don’t show package startup messages or column types, and don’t include any commands that I don’t ask for. Please include comments. Pay attention to upper and lower cases. To start: Write a code snippet to read in a file called MovieData.csv and store it in a data frame called MOVIES.
Based on the prompt above, the output from CoPilot is as follows:
# Load the tidyverse package without showing startup messages
suppressPackageStartupMessages(library(tidyverse))
# Read the MovieData.csv file and store it in a data frame called MOVIES
MOVIES <- read_csv("MovieData.csv", show_col_types = FALSE)
In the remainder of the document, we will look at some additional example prompts for CoPilot, the associated output from CoPilot, and the results from running the R code. These prompts are intended to serve as examples only – feel free to experiment with different prompts to tailor them to your exact needs.
NOTE: Within a single session, CoPilot’s Notebook “remembers” the previous prompts you’ve input, so it will generally interpret each prompt in the context of those that came before it. However, CoPilot imposes a limit of 10 to 30 responses for a single session depending on the length of your conversation. Once you reach the maximum, you will need to click the “New topic” button. This will start a new session, and you will have to re-describe the issue to the AI.
11.1.1 Data Manipulation and Basic Operations
Here we consider a few basic data handling commands in R.
Write code that shows the first 10 rows of the dataframe MOVIES, and also shows the names of all the columns.
## # A tibble: 10 × 11
## Title Year Budget DomesticBoxOffice TotalBoxOffice Rating Source Method
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Krrish 2006 1 e7 1430721 32430721 Not R… Origi… Live …
## 2 Crank 2006 1.20e7 27838408 43924923 R Origi… Live …
## 3 The Marine 2006 1.5 e7 18844784 22165608 PG-13 Origi… Live …
## 4 Running S… 2006 1.7 e7 6855137 9729088 R Origi… Live …
## 5 Renaissan… 2006 1.80e7 70644 2401413 R Origi… Digit…
## 6 BloodRayne 2006 2.5 e7 2405420 3711633 R Based… Live …
## 7 Ultraviol… 2006 3 e7 18522064 30962112 PG-13 Origi… Live …
## 8 Wu ji 2006 3 e7 669625 35869934 PG-13 Origi… Live …
## 9 Alex Ride… 2006 4 e7 659210 20722450 PG Based… Live …
## 10 Apocalypto 2006 4 e7 50866635 121032272 R Origi… Live …
## # ℹ 3 more variables: Genre <chr>, Sequel <dbl>, Runtime <dbl>
## [1] "Title" "Year" "Budget"
## [4] "DomesticBoxOffice" "TotalBoxOffice" "Rating"
## [7] "Source" "Method" "Genre"
## [10] "Sequel" "Runtime"
List all the different categories in the column Genre.
## [1] "Action/Adventure" "Comedy" "Drama" "Horror"
## [5] "RomCom" "Thriller"
Write code that calculates the mean and standard deviation of the column TotalBoxOffice, and displays it on the screen.
# Calculate the mean and standard deviation of the variable TotalBoxOffice
mean_total_box_office <- mean(MOVIES$TotalBoxOffice, na.rm = TRUE)
sd_total_box_office <- sd(MOVIES$TotalBoxOffice, na.rm = TRUE)
# Display the results on the screen
cat("Mean of TotalBoxOffice:", mean_total_box_office, "\n")
## Mean of TotalBoxOffice: 159219409
## Standard Deviation of TotalBoxOffice: 237684201
Now write code to calculate the mean and standard deviation of the TotalBoxOffice column, and display it to the screen, but only for comedies. [Note that this prompt didn’t precise state the category of Comedy
, but the AI was still able to figure it out what we meant.]
# Filter the data for the Comedy genre
comedy_data <- subset(MOVIES, Genre == "Comedy")
# Calculate the mean and standard deviation of the TotalBoxOffice column for comedies
mean_total_box_office <- mean(comedy_data$TotalBoxOffice, na.rm = TRUE)
sd_total_box_office <- sd(comedy_data$TotalBoxOffice, na.rm = TRUE)
# Display the mean and standard deviation
mean_total_box_office
## [1] 88030934
## [1] 88776316
11.1.2 Data Visualization
We can also prompt CoPilot to produce R code to create data visualizations. The ggplot
package is a powerful visualization package that will be used throughout the course.
For the Genre of Drama, make a histogram of all TotalBoxOffice values. Set the bin width in the histogram to be 50 million.
# Filter the data for the Drama genre
drama_data <- subset(MOVIES, Genre == "Drama")
# Create a histogram of TotalBoxOffice values for Drama genre with bin width of 50 million
ggplot(drama_data, aes(x = TotalBoxOffice)) +
geom_histogram(binwidth = 50000000, fill = "blue", color = "black") +
labs(title = "Total Box Office for Drama Genre", x = "Total Box Office", y = "Frequency") +
theme_minimal()
Using the ggplot package, create a scatter plot of TotalBoxOffice (on vertical axis) and Budget (on horizontal axis).
# Create a scatter plot of TotalBoxOffice vs Budget
ggplot(MOVIES, aes(x = Budget, y = TotalBoxOffice)) +
geom_point() +
labs(title = "Scatter Plot of Total Box Office vs Budget",
x = "Budget",
y = "Total Box Office") +
theme_minimal()
Because CoPilot maintains a memory of prompts used within the session, you can modify the output from a previous prompt without needing to provide all details again. For example:
Now modify the scatter plot to only include comedy and drama movies, coded as two different colors.
# Filter the dataset to include only comedy and drama movies
filtered_movies <- MOVIES %>%
filter(Genre %in% c("Comedy", "Drama"))
# Create a scatter plot of TotalBoxOffice vs Budget for comedy and drama movies
ggplot(filtered_movies, aes(x = Budget, y = TotalBoxOffice, color = Genre)) +
geom_point() +
labs(title = "Scatter Plot of Total Box Office vs Budget for Comedy and Drama Movies",
x = "Budget",
y = "Total Box Office") +
theme_minimal()
Note: If you hit new session, CoPilot will not remember your previous outputs.
Let’s try a more complex data visualization.
In the MOVIES dataframe, create a new column called “ROI” that is calculated as (TotalBoxOffice - Budget) / Budget. Then, Write R code using ggplot2 to visualize the return on investment (ROI) for movies. The code should: Use the MOVIES dataset. Map the ROI to the x-axis, capping values at 10. Color the histogram bars by Genre. Set the histogram bar color to white. Use a bin width of 0.5. Create separate facets for each Genre. Label the x-axis as “Return on Investment (ROI)” and format the labels as percentages.
# Create a new column called "ROI"
MOVIES <- MOVIES %>%
mutate(ROI = (TotalBoxOffice - Budget) / Budget)
# Cap ROI values at 10
MOVIES <- MOVIES %>%
mutate(ROI = ifelse(ROI > 10, 10, ROI))
# Load ggplot2 package
library(ggplot2)
# Visualize the return on investment (ROI) for movies
ggplot(MOVIES, aes(x = ROI, fill = Genre)) +
geom_histogram(color = "white", binwidth = 0.5) +
facet_wrap(~ Genre) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = "Return on Investment (ROI)", y = "Count") +
theme_minimal()
11.1.3 Statistical Analysis
Lastly, we can use CoPilot to generate R code to perform different kinds of statistical analysis. Below, we consider linear regression (which will be covered in weeks 5-7).
Using the dataframe MOVIES, write code that builds a linear regression model to predict TotalBoxOffice from Budget. Display the results, including a scatter plot and line of best fit.
# Build the linear regression model to predict TotalBoxOffice from Budget
model <- lm(TotalBoxOffice ~ Budget, data = MOVIES)
# Display the summary of the model
summary(model)
##
## Call:
## lm(formula = TotalBoxOffice ~ Budget, data = MOVIES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -647077570 -57490168 -11002941 31165099 1324789247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.659e+07 4.812e+06 -5.526 3.73e-08 ***
## Budget 3.478e+00 6.361e-02 54.681 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147500000 on 1873 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6146
## F-statistic: 2990 on 1 and 1873 DF, p-value: < 2.2e-16
# Plot the data and the regression line
ggplot(MOVIES, aes(x = Budget, y = TotalBoxOffice)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Linear Regression: Total Box Office vs Budget",
x = "Budget",
y = "Total Box Office")
## `geom_smooth()` using formula = 'y ~ x'
Here is a more ambitious prompt, which CoPilot was generally able to handle:
Now build a linear regression model that predicts TotalBoxOffice from Budget, but only for the Genre of “Comedy”. Repeat this model for the Genre of ``Drama”. Display the results including a color-coded scatter plot (using ggplot) for each movie genre, and the line of best fit for each genre. Display the equation for the line of best fit on the scatter plot, in black font.
# Filter the data for Comedy and Drama genres
comedy_movies <- subset(MOVIES, Genre == "Comedy")
drama_movies <- subset(MOVIES, Genre == "Drama")
# Build linear regression models for each genre
comedy_model <- lm(TotalBoxOffice ~ Budget, data = comedy_movies)
drama_model <- lm(TotalBoxOffice ~ Budget, data = drama_movies)
# Summary of the models
summary(comedy_model)
##
## Call:
## lm(formula = TotalBoxOffice ~ Budget, data = comedy_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -191230267 -43270897 -20890764 26149434 439639789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.418e+07 6.989e+06 3.46 0.00061 ***
## Budget 1.950e+00 1.728e-01 11.28 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75840000 on 340 degrees of freedom
## Multiple R-squared: 0.2724, Adjusted R-squared: 0.2703
## F-statistic: 127.3 on 1 and 340 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TotalBoxOffice ~ Budget, data = drama_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -250695634 -42428062 -23302161 11166459 734873594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.202e+06 7.148e+06 0.448 0.654
## Budget 2.372e+00 1.775e-01 13.366 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98800000 on 469 degrees of freedom
## Multiple R-squared: 0.2759, Adjusted R-squared: 0.2743
## F-statistic: 178.7 on 1 and 469 DF, p-value: < 2.2e-16
# Extract coefficients for the line equations
comedy_eq <- paste0("y = ", round(coef(comedy_model)[1], 2), " + ", round(coef(comedy_model)[2], 2), "x")
drama_eq <- paste0("y = ", round(coef(drama_model)[1], 2), " + ", round(coef(drama_model)[2], 2), "x")
# Plotting the results
ggplot() +
geom_point(data = comedy_movies, aes(x = Budget, y = TotalBoxOffice, color = "Comedy")) +
geom_point(data = drama_movies, aes(x = Budget, y = TotalBoxOffice, color = "Drama")) +
geom_smooth(data = comedy_movies, aes(x = Budget, y = TotalBoxOffice, color = "Comedy"), method = "lm", se = FALSE) +
geom_smooth(data = drama_movies, aes(x = Budget, y = TotalBoxOffice, color = "Drama"), method = "lm", se = FALSE) +
annotate("text", x = max(comedy_movies$Budget), y = max(comedy_movies$TotalBoxOffice), label = comedy_eq, color = "black", hjust = 1) +
annotate("text", x = max(drama_movies$Budget), y = max(drama_movies$TotalBoxOffice), label = drama_eq, color = "black", hjust = 1) +
theme_minimal() +
labs(title = "Total Box Office vs. Budget by Genre",
x = "Budget (in millions)",
y = "Total Box Office Revenue (in millions)",
color = "Genre")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Note that there is an issue with the output from CoPilot: The equations for the lines of best fit are scattered in different parts of the plot. This suggests further prompting is necessary to get the desired output from CoPilot. You may consider experimenting with CoPilot by asking the AI to assume different roles (e.g., ``assume the role of a Data Scientist”).
11.2 Observe
Although AI models can be helpful in crafting R code from scratch, beginner (and even advanced) users may not understand every single command used. To improve your understanding of the R language, you can copy-and-paste snippets of R code back as prompts and ask the AI to further explain them.
Let’s start by considering the following prompt that we used above:
Now write code to calculate the mean and standard deviation of the TotalBoxOffice column, and display it to the screen, but only for comedies.
# Filter the data for the Comedy genre
comedy_data <- subset(MOVIES, Genre == "Comedy")
# Calculate the mean and standard deviation of the TotalBoxOffice column for comedies
mean_total_box_office <- mean(comedy_data$TotalBoxOffice, na.rm = TRUE)
sd_total_box_office <- sd(comedy_data$TotalBoxOffice, na.rm = TRUE)
# Display the mean and standard deviation
mean_total_box_office
## [1] 88030934
## [1] 88776316
Suppose we wish to better understand what is going on in the R code above. We can feed the code back into CoPilot as a prompt as follows:
Look at the following piece of R code:
mean_total_box_office <- mean(comedy_data$TotalBoxOffice, na.rm = TRUE)
What does the “na.rm = TRUE” mean?
The text output from CoPilot is as follows:
#The na.rm = TRUE parameter in the mean and sd functions tells R to remove (rm) any NA (missing) values before performing the calculation. This ensures that the mean and standard deviation are calculated only on the available data, ignoring any missing values.
Note: If the prompts above or below are unclear, add the following to the prompt “Explain this to me like I am 12” (you can replace this with a different age, or career background). CoPilot will then further simplify its explanation to you.
A few more examples are below.
What does “subset(MOVIES, Genre ==”Comedy”)” mean?
# The function subset(MOVIES, Genre == "Comedy") filters the dataset MOVIES to include only the rows where the Genre column has the value “Comedy”. This creates a new dataset, comedy_data, containing only the comedy movies.
I see both “na.rm = TRUE” and “subset(MOVIES, Genre ==”Comedy”)“. Why is a double equation (==) used in some cases and a single (=) in others?
#In R, the double equals sign (==) and the single equals sign (=) serve different purposes:
#Double Equals (==): Used for comparison. It checks if two values are equal.
x <- 5
y <- 5
x == y # Returns TRUE
## [1] TRUE
#Single Equals (=): Used for assignment, similar to the <- operator. It assigns a value to a variable.
x = 5 # Assigns 5 to x
#In summary, use == for comparisons and = for assignments.
11.3 Debug
AI models like CoPilot can also be valuable in helping you find and correct errors in your R code. We consider two examples below.
11.3.1 Example 1
Suppose we wish to find the average movie budget for all horror movies. We might use the following code:
horror_movies <- subset(MOVIES, genre == "Horror")
average_budget <- mean(horror_movies$Budget, na.rm = TRUE)
print(average_budget)
The code above throws the following error:
## Error in eval(e, x, parent.frame()) : object 'genre' not found.
Next, we can use assistance from CoPilot to identify the error. In your prompt, it is helpful to include both the problematic code and the error message (if any) given in RStudio.
I have a bug in my code. I am going to share the code below, followed by the error message I received. Can you identify where the bug is? Please provide a written explanation. The code is as follows:
horror_movies <- subset(MOVIES, genre == “Horror”)
average_budget <- mean(horror_movies$Budget, na.rm = TRUE)
print(average_budget)
and the error message is:
Error in eval(e, x, parent.frame()) : object ‘genre’ not found.
And the output from CoPilot is:
## The error message “object ‘genre’ not found” indicates that the genre column does not exist in the MOVIES data frame. Here are some possible reasons and solutions:
##
## Column Name Typo: Ensure that the column name is correctly spelled as genre. Column names are case-sensitive in R.
##
## Column Existence: Verify that the MOVIES data frame actually contains a column named genre. You can check the column names using colnames(MOVIES).
##
## Data Frame Loading: Ensure that the MOVIES data frame is correctly loaded and populated with data before running the subset function.
Note: To improve the suggestion further, add the following: “I am new to programming and I need you to walk me through this process step-by-step to understand what is going wrong”.
Based on CoPilot’s suggestions, let’s check the column names in MOVIES:
## [1] "Title" "Year" "Budget"
## [4] "DomesticBoxOffice" "TotalBoxOffice" "Rating"
## [7] "Source" "Method" "Genre"
## [10] "Sequel" "Runtime" "ROI"
We can see that the column name we are looking for is spelled “Genre”, but we have used “genre” in the code above. Correcting this typo gives the desired output:
horror_movies <- subset(MOVIES, Genre == "Horror")
average_budget <- mean(horror_movies$Budget, na.rm = TRUE)
print(average_budget)
## [1] 26303365
11.3.2 Example 2
Suppose we wish to estimate a linear regression model that predicts TotalBoxOffice from Budget and Genre. We might hastily write the following code
However, this code gives the following error:
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## NA/NaN/Inf in 'y'
Let’s prompt CoPilot with the problematic code and error and see if it can help:
I would like to build a linear regression model that predicts TotalBoxOffice from Budget and Genre, using the dataframe MOVIES. I am using the following code
model <- lm(TotalBoxOffice = Budget & Genre, data = Movies) summary(model)
but it is giving me the following error
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’
Can you spot the error in my code?
## There are a few issues in your code:
##
## Formula Syntax: The formula in lm() should use ~ instead of = and + instead of &.
##
## Dataframe Name: Ensure the dataframe name is correctly spelled (Movies instead of MOVIES).
##
## Handling NA/NaN/Inf: The error suggests there are NA/NaN/Inf values in TotalBoxOffice. You need to handle these values.
##
## Here is the corrected code:
# Build a linear regression model predicting TotalBoxOffice from Budget and Genre
model <- lm(TotalBoxOffice ~ Budget + Genre, data = Movies)
# Summary of the model
summary(model)
From the “corrected” code above, we can see that CoPilot resolved the syntax issues with confusing =
for ~
and &
for +
. However, we can observe that the code is still not completely correct, because the data frame’s correct name is MOVIES, not Movies. Making that fix manually eventually gives us working code:
# Build a linear regression model predicting TotalBoxOffice from Budget and Genre
model <- lm(TotalBoxOffice ~ Budget + Genre, data = MOVIES)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = TotalBoxOffice ~ Budget + Genre, data = MOVIES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -649325219 -56135245 -10201170 31099389 1319945171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.911e+07 9.026e+06 -3.225 0.00128 **
## Budget 3.495e+00 7.489e-02 46.672 < 2e-16 ***
## GenreComedy 2.661e+06 1.083e+07 0.246 0.80587
## GenreDrama -2.553e+06 1.005e+07 -0.254 0.79945
## GenreHorror 3.390e+07 1.635e+07 2.074 0.03825 *
## GenreRomCom 1.430e+07 1.785e+07 0.801 0.42310
## GenreThriller -6.187e+06 1.195e+07 -0.518 0.60470
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147500000 on 1868 degrees of freedom
## Multiple R-squared: 0.6162, Adjusted R-squared: 0.615
## F-statistic: 499.9 on 6 and 1868 DF, p-value: < 2.2e-16
In general, AI models may not be able to instantly identify and resolve issues in you code. Instead, we can think of debugging as an iterative process where you obtain clues from the AI and work together to eventually converge on a solution.
11.4 Explain
Beyond programming, AI models can be a useful learning aid by explaining fundamental concepts. Again, specificity is important. As stated above, if the explanations are still unclear, instruct CoPilot to explain it to you like you are 12, or based on your career background.
What is the difference between standard deviation and variance? Don’t show any formulas and just explain it.
## Variance measures how spread out the data points are from the mean. It gives a sense of the overall dispersion in the dataset. Standard Deviation is the square root of the variance. It provides a more intuitive measure of spread by expressing it in the same units as the data points, making it easier to understand and compare.
What does it mean for two variables to be positively correlated?
## Two variables are positively correlated when an increase in one variable is associated with an increase in the other variable. This relationship indicates that the variables move in the same direction. For example:
##
## Height and Weight: Taller individuals tend to weigh more.
##
## Study Time and Test Scores: More study time often leads to higher test scores.
##
## The correlation coefficient, denoted as ( r ), ranges from 0 to +1 for positive correlation, where +1 indicates a perfect positive correlation.
Without using formulas, can you briefly explain conditional probability?
## Conditional probability is the likelihood of an event happening given that another event has already occurred. For example, if you want to know the probability of it raining today given that it was cloudy in the morning, you’re looking at conditional probability. It helps us understand how the occurrence of one event affects the probability of another event.
That’s helpful. Can you give me an example related to e-commerce where conditional probability may be useful?
## In e-commerce, conditional probability can be useful for predicting customer behavior. For example:
##
## Product Recommendations: If a customer has purchased a smartphone, the probability that they will buy a phone case is higher. This helps in suggesting relevant accessories.
##
## Customer Retention: If a customer has made multiple purchases in the past month, the probability that they will continue shopping on the platform increases. This can guide targeted marketing efforts.
##
## Fraud Detection: If a transaction is made from a new device, the probability of it being fraudulent might be higher, prompting additional security checks.
##
## These applications help in enhancing user experience, increasing sales, and ensuring security.