Appendix B: AI & LLMs & data visualization

1 Using LLMs/foundationa models to built predictive models

1.1 Attention: Hallucination..

Attention: Always cross-validate the information given by a LLM
- Why? Hallucination.. (see characterization statements on Wikipedia)
  - “a tendency to invent facts in moments of uncertainty” (OpenAI, May 2023)
  - “a model’s logical mistakes” (OpenAI, May 2023)
  - fabricating information entirely, but behaving as if spouting facts (CNBC, May 2023)
  - “making up information” (The Verge, February 2023)
Very good overview on Wikipedia
Discussions in Zhang et al. (2023), Huang et al. (2023) and Metz (2023)

1.2 Avaible LLMs

Closed-source
- ChatGPT X (OpenAI, ~Microsoft): https://chat.openai.com/
- Gemini (Google) https://gemini.google.com/
- Amazon Titan: https://aws.amazon.com/bedrock/titan/
Open-source
- HuggingChat: https://huggingface.co/chat/
LMSYS Chatbot Arena Leaderboard

1.3 Useful prompts

LLMs can be used to generate code for data visualization

I have a dataset called "data" that includes the variable age. Please provide me with ggplot code to produce a histogram.

Please explain the code (add comments to the code).

I want to change the x-axis lables (angle 50%).

I can I encode data dimensions in a graph? What possibilities do I have?

How can I ideally visualize a linegraph where the two lines are perfectly overlapping each other but I want to visualize just that.

How can I translate the code below when I want to use the sf package in R instead of rgdal?

readOGR(dsn="www/data", layer="VG250_GEM", encoding = "ASCII", verbose = FALSE)

2 ChatGPT4o: Generating a plot for your own data

Load the data into R (here we use the preloaded swiss dataset)
Run the code below

# Load necessary packages
    # install.packages("synthpop")
    library(datasets)
    library(synthpop)
    library(readr)

# Load the dataset
    data <- swiss # load your own dataset here

# View the original swiss dataset
    head(data)

# Generate synthetic data to anonymize the original dataset
# The syn function will generate synthetic data while preserving the structure and statistical properties
    synth_data <- syn(data)

# View the synthetic data
    head(synth_data$syn)

# Replace the original data with the synthetic data
    data <- synth_data$syn

# View the modified dataset to ensure it has been replaced correctly
    head(data)

# Save the new dataset locally
    write_csv(data, "data_fake.csv")

Upload data_fake.csv and Figure 1 (this is just an example) into ChatGPT4o.

Use the following prompt. If the plot is not based on that particular dataset, i.e., include the variable names of teh dataset, you may have to also add which variables should be mapped in which way.

I uploaded a dataset and a plot. Please provide me the R code that I need to produce that plot based on the data in one code chunk.

As a follow-up you can refine the plot code through prompts (“Please omit the intercept from the plot”).

References

Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, et al. 2023. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” November. https://arxiv.org/abs/2311.05232.

Metz, Cade. 2023. “Chatbots May ‘Hallucinate’ More Often Than Many Realize.” The New York Times, November.

Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, et al. 2023. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models,” September. https://arxiv.org/abs/2309.01219.