Further topics & summary & outlook

1 Using LLMs/foundationa models to built predictive models

1.1 Attention: Hallucination..

Attention: Always cross-validate the information given by a LLM
- Why? Hallucination.. (see characterization statements on Wikipedia)
  - “a tendency to invent facts in moments of uncertainty” (OpenAI, May 2023)
  - “a model’s logical mistakes” (OpenAI, May 2023)
  - fabricating information entirely, but behaving as if spouting facts (CNBC, May 2023)
  - “making up information” (The Verge, February 2023)
Very good overview on Wikipedia
Discussions in Zhang et al. (2023), Huang et al. (2023) and Metz (2023)

1.2 Avaible LLMs

Closed-source
- ChatGPT X (OpenAI, ~Microsoft): https://chat.openai.com/
- Gemini (Google) https://gemini.google.com/
- Amazon Titan: https://aws.amazon.com/bedrock/titan/
Open-source
- HuggingChat: https://huggingface.co/chat/
Curated list of papers about large language models
Top Open-Source LLMs for 2024 and Their Uses

1.3 Useful prompts

LLMs can be used to discuss theory underlying ML and generate code for ML
Data preparation & preprocessing

How do I need to prepare and preprocess the data if I want to built a Naive Bayes classifier?


What is particular in data preparation for Naive Bayes that is not necessary for other machine learning models?


How should I ideally preprocess the data that I feed into a Naive Bayes classifier?

I want to build a Naive Bayes Classifier. Please outline the preprocessing steps that you would recommend and provide tidymodels recipe code that includes those step.
Please write the code into a single recipe.

Understanding & comparing models

What is the difference between a logistic regression model and naive bayes in the machine learning context?


Which machine learning models that we can use for classification have a problem with class imbalance?

Understanding code

Please explain the hyperparamters in this model:

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()                          ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Followed by:

Please further explain the learn_rate:

Data management: Below copy descriptions from ESS codebook and create better variable names.

Please provide dplyr code to rename the following variables and give them better names (lowercaps). Below is the codebook:

pdwrk - Doing last 7 days: paid work
edctn - Doing last 7 days: education
uempla - Doing last 7 days: unemployed, actively looking for job
uempli - Doing last 7 days: unemployed, not actively looking for job
dsbld - Doing last 7 days: permanently sick or disabled
rtrd - Doing last 7 days: retired
cmsrv - Doing last 7 days: community or military service
hswrk - Doing last 7 days: housework, looking after children, others
dngoth - Doing last 7 days: other
dngref - Doing last 7 days: refusal
dngdk - Doing last 7 days: don't know
dngna - Doing last 7 days: no answer

References

Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, et al. 2023. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” November. https://arxiv.org/abs/2311.05232.

Metz, Cade. 2023. “Chatbots May ‘Hallucinate’ More Often Than Many Realize.” The New York Times, November.

Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, et al. 2023. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models,” September. https://arxiv.org/abs/2309.01219.