Chapter 7 Basic plot with R
7.1 data steps for visualization basic with ggplot
data import from research tutor, now use ‘international COVID19 survey: post new normal life’ the leader of this project is ‘Laura, Univ of Michigan’. And this tutorial data is only for 1st wave survey of Korea (Jun).
youtube | url |
---|---|
yonsei | https://youtu.be/85HuqHX3JlY |
::read_xlsx('data/covid_korea.XLSX', sheet =1)
covid<-readxl::read_xlsx('data/dic_kor.xlsx')
dic<-readxl::read_xlsx('data/dic_eng.xlsx')
dic.eng<-readxl dic.eng %>%
dic.eng <- rename('variables' = 변수명)
overview of data (just 10 rows)
head(covid) %>% datatable()
head(dic) %>% datatable()
datatable(dic.eng)
7.1.1 create basic variables
create basic variables for tutor, and list of variables.
new.variables | origianal variables | methods |
---|---|---|
age | SQ2 | 2020-SQ2 |
gender | SQ1 | 1 ~ men, 2~ women |
Q6 | Q6 | How long do you think you will be able to maintain your current level of social distancing? |
Q7 | Q7 | Q7. How difficult is it for you to maintain your current level of social distancing? |
Q8 | Q8 | Q8. When do you think your community will be “safe” from the COVID-19 pandemic? |
%>%
covid <-covid mutate(age = SQ2, # already calculate
gender=ifelse(SQ1 == 1, 'men', 'women'))
7.2 ggplot2
7.2.1 bar and histograph
bar and histographs display the distribution according to user defined data points (x-axis).
%>%
covid ggplot(aes(x=Q6)) +
geom_histogram(col ="red",
fill="grey") +
labs(title ="histogram for keep social distancing")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the same context, We can make bar plot just add geom_bar()
, as below. Almost responses are that ‘6 months or more’.
%>%
covid ggplot(aes(x=Q6)) +
geom_bar()
10 means ‘6 month or more’ so, we can change the x axis label to easy for following. This step is very import to communicate with collaborators. please keep in kind to communicate with your friend using labeling.
%>%
covid ggplot(aes(x=Q6)) +
geom_bar() +
scale_x_continuous(breaks = c(3, 5, 8, 10),
labels = c(
#'1' = '1-2days',
#'2' = 'less than 1 weeks',
'3' = '1-2 week',
#'4' = '3-4 weeks',
'5' = '1 month',
#'6' = '2 months',
#'7' = '3 months',
'8' = '4 months',
#'9' = '5 months',
'10' = '6 months or more'
))
We usually used
group_by, count
code for data summary, so now we plot summarized data using stat='identity'
in ggplot. This process are more conbinience because I usually want to overview my summarized data.
%>%
covid group_by(Q6)%>%
count() %>%
ggplot(aes(x=Q6, y =n)) +
geom_bar(stat='identity')
7.2.2 Boxplot
What are characteristics related to levels of keeping social distancing. To find related characteristics, we can draw boxplot, too. First, let’s find variables which have large difference of expectance of keeping social distance. Now, I assumed that 6 months and more is good indicator of keeping social distance. So, if some expected they can keep social distancing more than 6 months, they will categorized into ‘keeping social distancing’, if else ‘non-keeping social dstancing’ (keeping vs non-keeping). In code book, we can match that 1 means 1-2 days, and 10 means 6 months or more. so we can change it per months while 6 months or more into just 6 months. Although it is measurement error when we used below mutate, but now we use new variable ’keep_sdist, social distancing" just for data overview. I made keep_sdist (unit is months scale), and keep_sdistgp for grouping data.
covid %>%
covid2 <- mutate(keep_sdist=case_when(
==1 ~ 1.5/30,
Q6 ==2 ~ 7/30,
Q6 ==3 ~ (7+3.5)/30,
Q6 ==4 ~ (21+3.5)/30,
Q6 ==5 ~ 1,
Q6 ==6 ~ 2,
Q6 ==7 ~ 3,
Q6 ==8 ~ 4,
Q6 ==9 ~ 5,
Q6 ==10 ~ 6,
Q6 TRUE ~ NA_real_
)) class(covid2$keep_sdist)
## [1] "numeric"
covid2 %>%
covid2 <- mutate(keep_sdist_gp = ifelse(Q6 %in% c(10), 1, 0)) %>%
mutate(keep_sdist_ft = ifelse(Q6 %in% c(10), 'Yes', 'No'))
Here are box plot with gender stratification about the age (median, inter quartile) and
covid2 %>%
figbox<- ggplot(aes(x=keep_sdist_ft, y = age)) +
geom_boxplot()+
facet_wrap(gender~.) +
theme_minimal()+
xlab('keeping social distancing more than 6 months')
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(figbox)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
7.2.3 scatter plot
Scatter plot display association two variables. Now I want display association between age and “How long do you think you will be able to maintain your current level of social distancing?”. Add geom_point()
make scatter plot in ggplot2.
%>%
covid2 ggplot(aes(x=age, y = Q6)) +
geom_point()
The scatter plot is below.
%>%
covid2 ggplot(aes(x=age, y = keep_sdist)) +
geom_point()
However, it is not easy to overview the relationships between age and perceive response of keeping social distancing. Now we added regression line using
geom_smooth()
%>%
covid2 ggplot(aes(x=age, y = keep_sdist)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
That graphs show that older age responded that they could keep more months of social distancing.
7.2.3.1 add regression line
How about added linear regression lines? or user defined model?
For linear model, we can use method ='lm'
## linear regression line
%>%
covid2 ggplot(aes(x=age, y = keep_sdist)) +
geom_point() +
geom_smooth(method ='lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
for user define model, we can plot predicted
value or fitted.values
value as manually.
## regression model
lm(data=covid2,
lm_fit<-~ age)
keep_sdist
%>%
covid2 ggplot(aes(x=age, y = keep_sdist)) +
geom_point()+
geom_line(aes(x=age, y = predict(lm_fit)), color ='red')
7.2.4 scatter plot of summary stat
Now I want measure the percent of '6 month or more' response according to age
%>%
covid group_by(age) %>%
count(Q6==10) # What dose `TRUE` means
## # A tibble: 99 x 3
## # Groups: age [50]
## age `Q6 == 10` n
## <dbl> <lgl> <int>
## 1 20 FALSE 6
## 2 20 TRUE 4
## 3 21 FALSE 5
## 4 21 TRUE 2
## 5 22 FALSE 2
## 6 22 TRUE 5
## 7 23 FALSE 13
## 8 23 TRUE 12
## 9 24 FALSE 12
## 10 24 TRUE 11
## # … with 89 more rows
TRUE
in column of Q6==10
means that participant responded as ‘6 month or more’, and the n
means number of count. I want estimate proportion prob
as n/sum(n)
%>%
covid group_by(age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100)
## # A tibble: 99 x 4
## # Groups: age [50]
## age `Q6 == 10` n prob
## <dbl> <lgl> <int> <dbl>
## 1 20 FALSE 6 60
## 2 20 TRUE 4 40
## 3 21 FALSE 5 71.4
## 4 21 TRUE 2 28.6
## 5 22 FALSE 2 28.6
## 6 22 TRUE 5 71.4
## 7 23 FALSE 13 52
## 8 23 TRUE 12 48
## 9 24 FALSE 12 52.2
## 10 24 TRUE 11 47.8
## # … with 89 more rows
Can you find what I exact want in above output? for bar_plot?. Yes, the row of ’TRUE in Q6 == 10
column is what we want to know. So, I remain only TRUE
raw.
%>%
covid group_by(age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100) %>%
filter(`Q6 == 10` == TRUE) %>%
ggplot(aes(x = age, y = prob)) +
geom_point()
Again, we can plot regression line.
%>%
covid group_by(age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100) %>%
filter(`Q6 == 10` == TRUE) %>%
ggplot(aes(x = age, y = prob)) +
geom_point() +
geom_smooth(se=FALSE, color = 'grey')+
geom_smooth(method='lm', color ='red', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
How about gender stratified graphs? We can use
group_by(gender, age)
and facet_wrap(gender ~.)
%>%
covid group_by(gender, age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100) %>%
filter(`Q6 == 10` == TRUE) %>%
ggplot(aes(x = age, y = prob)) +
geom_point() +
geom_smooth(se=FALSE, color = 'grey')+
geom_smooth(method='lm', color ='red', se=FALSE)+
facet_wrap(gender ~.)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
When I review above data, I want compare the gender difference according to age
%>%
covid group_by(gender, age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100) %>%
filter(`Q6 == 10` == TRUE) %>%
ggplot(aes(x = age, y = prob, group = gender)) +
geom_point(aes(color = gender, size = n), alpha = 0.5) +
geom_smooth(method ='loess',
formula = y ~ x,
se=FALSE, aes(color = gender))+
theme_minimal()+
xlab('Age') +ylab('Keeping Social Distance')
There were gender difference, women show different two groups of around 30 years old and 50 years old.
%>%
covid group_by(gender, age) %>%
count(Q6==10) %>%
mutate(prob = n/sum(n)*100) %>%
filter(`Q6 == 10` == TRUE) %>%
ggplot(aes(x = age, y = prob, group = gender)) +
geom_point(aes(color = gender, size = n), alpha = 0.5) +
geom_smooth(method ='loess',
formula = y ~ x,
se=FALSE, aes(color = gender))+
theme_minimal()+
xlab('Age') +ylab('Keeping Social Distance')+
facet_wrap(gender~.)+
stat_density2d()
7.2.5 line chart
The chis-quare test is sometimes displayed via line chart. Particularly, the proportion of True and it trend can be highlighted by line chart. let’s compare the line chart and bar chart.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
If there are multiple plot, the line chart can highlight the trends of value change.
covid2 %>%
emotion <- select(gender, keep_sdist_ft, contains('Q18'),
) covid2 %>%
lifepattern <- select(gender, keep_sdist_ft, contains('Q19'),
)
head(emotion)
## # A tibble: 6 x 9
## gender keep_sdist_ft Q18_1 Q18_2 Q18_3 Q18_4 Q18_5 Q18_6 Q18_7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 women Yes 3 3 3 2 1 2 2
## 2 women No 2 2 2 2 2 2 2
## 3 men Yes 2 2 2 2 2 2 2
## 4 women No 2 2 2 3 1 1 2
## 5 women Yes 3 2 3 2 2 1 2
## 6 men Yes 2 2 2 2 2 2 2
head(lifepattern)
## # A tibble: 6 x 20
## gender keep_sdist_ft Q19_1 Q19_2 Q19_3 Q19_4 Q19_5 Q19_6 Q19_7 Q19_8 Q19_9
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 women Yes 3 2 1 2 1 3 2 2 3
## 2 women No 2 2 1 2 2 2 2 2 2
## 3 men Yes 2 2 2 2 2 2 2 2 2
## 4 women No 2 2 2 1 2 3 2 2 2
## 5 women Yes 1 2 2 2 2 3 1 2 3
## 6 men Yes 2 2 2 1 2 2 2 2 2
## # … with 9 more variables: Q19_10 <dbl>, Q19_11 <dbl>, Q19_12 <dbl>,
## # Q19_13 <dbl>, Q19_14 <dbl>, Q19_15 <dbl>, Q19_16 <dbl>, Q19_17 <dbl>,
## # Q19_18 <dbl>
colnames(emotion) <-c(
'Gender',
'keep social distancing',
'lonely',
'depressive',
'anxious',
'angry',
'Happy',
'Relaxed',
'Focused')
colnames(lifepattern)<-c(
'Gender',
'keep social distancing',
'Exercising',
'Engaging in creative activities',
'Sleeping',
'Avoiding the news',
'Helping others',
'Reading\nlistening to the news',
'Using alcohol or other drugs',
'Seeking emotional \nsupport from others',
'Spending time on social media',
'Playing games',
'Sharing funny memes\n and videos',
'Eating due to stress or boredom',
'Connecting with family\nfriends',
'Watching movies\nTV',
'Praying or meditating',
'Feeling grateful',
'Journaling',
'Working\ndoing schoolwork'
)
transfome the data from wide to long
emotion %>%
em.long <- gather(-`keep social distancing`,-Gender, key ='emotion', value ='trend')
## wide to long transform will be introduced in common data model part
%>%
em.long filter(Gender =='men') %>%
group_by(emotion, trend) %>%
count(`keep social distancing`) %>%
mutate(prob = n/sum(n)) %>%
filter(`keep social distancing` == 'Yes') %>%
ggplot(aes(x=trend, y = prob)) +
geom_line()+
facet_wrap(emotion~., scales = 'free') +
theme_minimal() +
ggtitle('Men, Expecting of keeping social distance and emotional change')
7.3 Diagram
동영상 | link |
---|---|
jinhaslab | https://youtu.be/VU6z2sc9WEY |
순서도를 그리는 것을 간략 정리하겠습니다. A -> B 로 순서도를 그립니다.
library(DiagrammeR)
::grViz(
DiagrammeR"digraph rmarkdown {
A -> B
}", height = 400
)
여기에 박스 모양등 여러가지 옵션을 정할 수 있습니다.
::grViz("
DiagrammeRdigraph boxes_and_circles {
# 노드를 어떻게 할지 설정
node [shape = box]
A; B; C
node [shape = circle]
1; 2; 3;4
# 연결을 어떻게 할지 설정
A->1 B->2 C ->3
A->4
}
")
순서와 위치, 그리고 색
::grViz("
DiagrammeRdigraph boxes_and_circles {
# 박스 모양 할것 정하기
node [shape = box]
A; B; C
# 원 모양 할 것 정하기
node [shape = circle]
1; 2; 4
#색
abc[style = filled, fillcolor = Yellow]
# 연결 부분
A->1 C->2 B ->C
A->4 C-> 4
abc -> 3
# 위치
subgraph{
rank = same; A;B;C
}
}
")
위아래 TB (top buttom), 왼쪽 오른쪽 (LR)에 대해 지정해 줄 수 있습니다.
::grViz("digraph {
DiagrammeR
graph [layout = dot, rankdir = LR] # 왼쪽 오른쪽 방향
# 일반거 스타이 미리 지정
node [shape = rectangle, style = filled, fillcolor = grey]
dat1 [label = 'A 사업장 검진', shape = folder, fillcolor = Yellow]
dat2 [label = 'B 사업장 검진', shape = folder, fillcolor = Yellow]
dat3 [label = 'C 사업장 검진', shape = folder, fillcolor = Yellow]
cdm1 [label = '표준화1', shape = circle, fillcolor = Beige]
cdm2 [label = '표준화2', shape = circle, fillcolor = Beige]
cdm3 [label = '표준화3', shape = circle, fillcolor = Beige]
process [label = '공통 변수/값']
statistical [label = '분산형 계산']
results [label= '결과']
# edge definitions with the node IDs
dat1 -> cdm1 dat2 -> cdm2 dat3 -> cdm3
{cdm1 cdm2 cdm3} -> process -> statistical -> results
}")
기존의 데이터의 값을 불러와서 사용할 수 있습니다. 보건학 표 만들기에서 가장 많이 사용될 만한 코드입니다. 예를 들어 1000명의 inclusion에서 100명씩 exclusion되어 800명이 남는 상황을 표현해 보겠습니다.
# Define some sample data
library(tidyverse)
tibble(inc1=1000,
data <-exc1=100) %>%
mutate(inc2 = inc1 - exc1)
::grViz("
DiagrammeRdigraph graph2 {
node [shape=box, fontsize = 12];
inc1;inc2;exc1
node [shape = point, width = 0, height = 0]
''
graph [layout = dot]
inc1 [label = '@@1']
inc2 [label = '@@2']
exc1 [label = '@@3']
inc1 -> '' [arrowhead = none]
'' -> exc1
'' -> inc2
subgraph {
rank = same; ''; exc1;
}
}
[1]: paste0('Total population (n = ', data$inc1, ')')
[2]: paste0('Economic activity (n = ', data$inc2, ')')
[3]: paste0('No economic activity (n = ', data$exc1, ')')
")