Appendix B: AI & LLMs & data visualization
1 Using LLMs/foundationa models to built predictive models
1.1 Attention: Hallucination..
- Attention: Always cross-validate the information given by a LLM
- Why? Hallucination.. (see characterization statements on Wikipedia)
- “a tendency to invent facts in moments of uncertainty” (OpenAI, May 2023)
- “a model’s logical mistakes” (OpenAI, May 2023)
- fabricating information entirely, but behaving as if spouting facts (CNBC, May 2023)
- “making up information” (The Verge, February 2023)
- Why? Hallucination.. (see characterization statements on Wikipedia)
- Very good overview on Wikipedia
- Discussions in Zhang et al. (2023), Huang et al. (2023) and Metz (2023)
1.2 Avaible LLMs
- Closed-source
- ChatGPT X (OpenAI, ~Microsoft): https://chat.openai.com/
- Gemini (Google) https://gemini.google.com/
- Amazon Titan: https://aws.amazon.com/bedrock/titan/
- Open-source
- HuggingChat: https://huggingface.co/chat/
- LMSYS Chatbot Arena Leaderboard
1.3 Useful prompts
- LLMs can be used to generate code for data visualization
I have a dataset called "data" that includes the variable age. Please provide me with ggplot code to produce a histogram.
Please explain the code (add comments to the code).
I want to change the x-axis lables (angle 50%).
I can I encode data dimensions in a graph? What possibilities do I have?
How can I ideally visualize a linegraph where the two lines are perfectly overlapping each other but I want to visualize just that.
How can I translate the code below when I want to use the sf package in R instead of rgdal?
readOGR(dsn="www/data", layer="VG250_GEM", encoding = "ASCII", verbose = FALSE)
2 ChatGPT4o: Generating a plot for your own data
Load the data into R (here we use the preloaded
swiss
dataset)Run the code below
# Load necessary packages
# install.packages("synthpop")
library(datasets)
library(synthpop)
library(readr)
# Load the dataset
data <- swiss # load your own dataset here
# View the original swiss dataset
head(data)
# Generate synthetic data to anonymize the original dataset
# The syn function will generate synthetic data while preserving the structure and statistical properties
synth_data <- syn(data)
# View the synthetic data
head(synth_data$syn)
# Replace the original data with the synthetic data
data <- synth_data$syn
# View the modified dataset to ensure it has been replaced correctly
head(data)
# Save the new dataset locally
write_csv(data, "data_fake.csv")
- Upload
data_fake.csv
and Figure 1 (this is just an example) into ChatGPT4o.
- Use the following prompt. If the plot is not based on that particular dataset, i.e., include the variable names of teh dataset, you may have to also add which variables should be mapped in which way.
I uploaded a dataset and a plot. Please provide me the R code that I need to produce that plot based on the data in one code chunk.
- As a follow-up you can refine the plot code through prompts (“Please omit the intercept from the plot”).
References
Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, et al. 2023. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” November. https://arxiv.org/abs/2311.05232.
Metz, Cade. 2023. “Chatbots May ‘Hallucinate’ More Often Than Many Realize.” The New York Times, November.
Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, et al. 2023. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models,” September. https://arxiv.org/abs/2309.01219.