8 Descriptive stats with R and ggplot2

In this chapter you will practice exploratory data analysis and get to know the ggplot2 package. The ggplot2 package is a graphing package that allows you to create all kinds of fancy looking plots. It is important that you replicate the code and graphs, and experiment with changing parameters.We also use the dplyr package, which allows us to easily handle and arrange data. You need to install both packages unless you already did so and then load them into your library.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.2

We will use the titanic data set “titanic_333.csv”, which contains data on the passengers on the Titanic. I removed the passengers’ home towns, and some other non-pertinent info. The full data set can be found at https://www.kaggle.com/datasets/vinicius150987/titanic3?resource=download, if you are interested..

8.1 Exploring the data

First, we read in the data set. Note that this only works if you download the file into your working directory first. The commands ‘str’, and glimpse’ give us a first look at the data.

titanic <- read.csv("titanic_333.csv")
str(titanic)
#> 'data.frame':    1309 obs. of  9 variables:
#>  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ pclass  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
#>  $ name    : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
#>  $ sex     : chr  "female" "male" "female" "male" ...
#>  $ age     : num  29 0.917 2 30 25 ...
#>  $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
#>  $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
#>  $ fare    : num  211 152 152 152 152 ...
glimpse(titanic)
#> Rows: 1,309
#> Columns: 9
#> $ X        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
#> $ pclass   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ survived <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,…
#> $ name     <chr> "Allen, Miss. Elisabeth Walton", "Allison…
#> $ sex      <chr> "female", "male", "female", "male", "fema…
#> $ age      <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000…
#> $ sibsp    <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0,…
#> $ parch    <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ fare     <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 1…

‘summary’ will give us the 5-point summary for each of the numeric variables. We see that some of the observations are missing entries for age and/ or fare. We need to keep that in mind.

summary(titanic)
#>        X            pclass         survived    
#>  Min.   :   1   Min.   :1.000   Min.   :0.000  
#>  1st Qu.: 328   1st Qu.:2.000   1st Qu.:0.000  
#>  Median : 655   Median :3.000   Median :0.000  
#>  Mean   : 655   Mean   :2.295   Mean   :0.382  
#>  3rd Qu.: 982   3rd Qu.:3.000   3rd Qu.:1.000  
#>  Max.   :1309   Max.   :3.000   Max.   :1.000  
#>                                                
#>      name               sex                 age         
#>  Length:1309        Length:1309        Min.   : 0.1667  
#>  Class :character   Class :character   1st Qu.:21.0000  
#>  Mode  :character   Mode  :character   Median :28.0000  
#>                                        Mean   :29.8811  
#>                                        3rd Qu.:39.0000  
#>                                        Max.   :80.0000  
#>                                        NA's   :263      
#>      sibsp            parch            fare        
#>  Min.   :0.0000   Min.   :0.000   Min.   :  0.000  
#>  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:  7.896  
#>  Median :0.0000   Median :0.000   Median : 14.454  
#>  Mean   :0.4989   Mean   :0.385   Mean   : 33.295  
#>  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.: 31.275  
#>  Max.   :8.0000   Max.   :9.000   Max.   :512.329  
#>                                   NA's   :1

This looks pretty good, except survived should be logical or a yes/no character, and the passenger class should be a factor. Let’s change that using the pipe operator.

titanic <- titanic %>%
  mutate(survived = ifelse(survived==0, "No","Yes"))%>%
  mutate(pclass = as.factor(pclass))
str(titanic)
#> 'data.frame':    1309 obs. of  9 variables:
#>  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ pclass  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ survived: chr  "Yes" "Yes" "No" "No" ...
#>  $ name    : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
#>  $ sex     : chr  "female" "male" "female" "male" ...
#>  $ age     : num  29 0.917 2 30 25 ...
#>  $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
#>  $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
#>  $ fare    : num  211 152 152 152 152 ...
summary(titanic)
#>        X        pclass    survived        
#>  Min.   :   1   1:323   Length:1309       
#>  1st Qu.: 328   2:277   Class :character  
#>  Median : 655   3:709   Mode  :character  
#>  Mean   : 655                             
#>  3rd Qu.: 982                             
#>  Max.   :1309                             
#>                                           
#>      name               sex                 age         
#>  Length:1309        Length:1309        Min.   : 0.1667  
#>  Class :character   Class :character   1st Qu.:21.0000  
#>  Mode  :character   Mode  :character   Median :28.0000  
#>                                        Mean   :29.8811  
#>                                        3rd Qu.:39.0000  
#>                                        Max.   :80.0000  
#>                                        NA's   :263      
#>      sibsp            parch            fare        
#>  Min.   :0.0000   Min.   :0.000   Min.   :  0.000  
#>  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:  7.896  
#>  Median :0.0000   Median :0.000   Median : 14.454  
#>  Mean   :0.4989   Mean   :0.385   Mean   : 33.295  
#>  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.: 31.275  
#>  Max.   :8.0000   Max.   :9.000   Max.   :512.329  
#>                                   NA's   :1

8.2 Making a bar graph with ggplot2

You already saw how to use base R to create plots, so we will use ggplot2 this time. There are basically two approaches to getting started with ggplot2. You could use one of the many good guides available online, for example https://ggplot2-book.org. Or you can start with a code chunk and modify it until you are satisfied. In either case, you will most likely reach a point were you need help. There is a very active R community online, for example at https://community.rstudio.com/t/welcome-to-the-rstudio-community/8 or https://stackoverflow.com/search?q= . Post your question to one of the forums, or google the answer. It is perfectly legit to learn from other people instead of re-inventing the wheel. Be aware though that answers are often needlessly complicated (people showing off). And remember, if you are submitting work for credit in this class, cite your source.

Every ggplot has 3 component: the data, to use, a set of aesthetic mappings between variable(s) in the data and - optional - some visual properties, and the geom_ functions describing how to plot the observations. Here our data istitanic, we want to use the variable ’survived, and we want a bar graph. For geom_bar, the height of a bar is proportional to the number of cases in the corresponding group. Note that you can simply rotate the graph by specifying that the variable to be plotted should correspond to y, or just by adding +coordflip().

ggplot(titanic,aes(x=survived)) +
  geom_bar()

ggplot(titanic,aes(y=survived)) +
  geom_bar()

ggplot(titanic,aes(x=survived)) +
  geom_bar() +
  coord_flip()

We can add some color, titles, and labels,and vary the opacity of the fill, the width of the lines, and the width of the bars. A list of R color names can be found here: http://sape.inf.usi.ch/quick-reference/ggplot2/colour

ggplot(titanic,aes(x=survived)) +
  geom_bar( color="blue", linewidth = 2, fill="lightgoldenrod1", alpha=0.5, width=1)+
  labs(title = "number of people surviving titanic",x="survived?", y="total count")

Here we tweak the code to show the percent of people surviving.

ggplot(titanic,aes(x=survived,y=after_stat(prop), group=1)) +
  geom_bar( color="blue", linewidth = 2, fill="lightgoldenrod1", alpha=0.5, width=1)+
  labs(title = "percent of people surviving titanic",x="survived?", y="total count")

We can assign colors etc. by group:

ggplot(titanic,aes(x=survived, y=after_stat(prop), group=1)) +
  geom_bar( color=c("darkorchid", "black"), linewidth=c(1,2),
            fill=c("tomato1","lightskyblue"), alpha=c(0.7,0.1), width=c(0.3,0.1))+
  labs(title = "percent of people surviving titanic")

8.3 Grouping data by passenger class

First, let’s examine the survival rate by passenger class. We use the pipe operator to group and summarize the data by class. For this code chunk I have inserted comments with the #, I want you to do that for the rest of the code chunks in this chapter.

set_one <- titanic %>%  #states to use the data frame titanic
  group_by(pclass)%>%   #group the data by passenger class
  count(survived) %>%   # count the number n of survivors in each group
  mutate(percent = n/sum(n)*100) #create a new variable percent and add it to the data frame
set_one   #print out set_one
#> # A tibble: 6 × 4
#> # Groups:   pclass [3]
#>   pclass survived     n percent
#>   <fct>  <chr>    <int>   <dbl>
#> 1 1      No         123    38.1
#> 2 1      Yes        200    61.9
#> 3 2      No         158    57.0
#> 4 2      Yes        119    43.0
#> 5 3      No         528    74.5
#> 6 3      Yes        181    25.5

Below are point graphs (geom_point) and line graphs (geom_line). You should see that neither is appropriate for this situation. Also note that we can chose the color of a point based on its value. ggplot allows us to easily arrange plots on a grid if we name them (I used p1, p2, p3, p4) and use the package gridExtra with the function grid.arrange.

p1 <- ggplot(set_one, aes(x=pclass, y=n, color=survived))+geom_point()+
  labs(title="Point plot, number of people surviving", subtitle = "by passenger class", x="passenger class",y="percent surviving")

p2 <- ggplot(set_one, aes(x=pclass, y=percent, color=survived))+geom_point()+
    labs(title="Point plot, percent of people surviving", subtitle = "by passenger class", x="passenger class",y="percent surviving")

p3 <- ggplot(set_one, aes(x=as.numeric(pclass), y=n, color=survived))+geom_line()+
  labs(title="Line plot, number of people surviving", subtitle = "by passenger class", x="passenger class",y="percent surviving")
  
p4 <- ggplot(set_one, aes(x=as.numeric(pclass), y=percent, color=survived))+geom_line()+
    labs(title="Line plot, percent of people surviving", subtitle = "by passenger class", x="passenger class",y="percent surviving")

library(gridExtra)
#> 
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
grid.arrange(p1, p2, p3, p4, nrow=2)

We will show what is going on using bar graphs. It is personal preference whether you prefer the bars stacked or side-by-side, and the shading by passenger class or by survival.

p1 <- ggplot(set_one, aes(x=pclass, y=percent, fill=survived)) + 
  geom_bar(position="dodge", stat="identity") +
  labs(title = "percent of people surviving",subtitle = "by passenger class")

p2 <- ggplot(set_one, aes(x=survived, y=n, fill=pclass)) + 
  geom_bar(position="dodge", stat="identity")+
  labs(title = "number of people surviving",subtitle = "by passenger class")

p3 <-ggplot(set_one, aes(x=pclass, y=percent, fill=survived)) + 
  geom_bar(position="stack", stat="identity") +
  labs(title = "percent of people surviving ",subtitle = "by passenger class")+
  scale_fill_manual(values = c("black","pink"))

p4 <-ggplot(set_one, aes(x=survived, y=n, fill=pclass)) + 
  geom_bar(position="stack", stat="identity")+
  labs(title = "number of people surviving ",subtitle = "by passenger class")+
  scale_fill_manual(values = c("black","grey","white"))
grid.arrange(p1, p2, p3, p4, nrow=2)
titanic <- read.csv("titanic_333.csv")
titanic <- titanic %>%
  mutate(survived = ifelse(survived=="1", "alive","dead"))
set_one <- titanic %>%  #states to use the data frame titanic
  group_by(pclass)%>%   #group the data by passenger class
  count(survived) %>%   # count the number n of survivors in each group
  mutate(percent = n/sum(n)*100) #create a new variable percent and add it to the data frame

set_one   #print out set_one
#> # A tibble: 6 × 4
#> # Groups:   pclass [3]
#>   pclass survived     n percent
#>    <int> <chr>    <int>   <dbl>
#> 1      1 alive      200    61.9
#> 2      1 dead       123    38.1
#> 3      2 alive      119    43.0
#> 4      2 dead       158    57.0
#> 5      3 alive      181    25.5
#> 6      3 dead       528    74.5

ggplot(set_one, aes(x=pclass, y=percent, fill=survived)) + 
  geom_bar(position="dodge", stat="identity") +
  labs(title = "percent of people surviving",subtitle = "by passenger class")

Assignment 1

  1. What is the best graph for this situation and why?

  2. What graph is totally unsuited?

  3. What does the data tell you about survival?

8.4 Your turn

Now that you have the basics, you can work through the rest of the chapter yourself. Make sure you work all the assignments.

Assignment 2 Insert comments into the code chunk below.

set_two  <- titanic %>%
  group_by(sex)%>%
  count(survived) %>%
  mutate(percent = n/sum(n)*100)
set_two
#> # A tibble: 4 × 4
#> # Groups:   sex [2]
#>   sex    survived     n percent
#>   <chr>  <chr>    <int>   <dbl>
#> 1 female alive      339    72.7
#> 2 female dead       127    27.3
#> 3 male   alive      161    19.1
#> 4 male   dead       682    80.9

Assignment 3 You pick and code a graph that you think looks good.Or if you prefer, you can start with this one and pretty it up with labels etc. What does the data tell you about survival?

ggplot(set_two, aes(x=sex, y=percent, fill=survived)) + 
  geom_bar(position="stack", stat="identity")

Assignment 4 Insert comments into the code chunk below.

titanic <- read.csv("titanic_333.csv")
titanic <- titanic %>%
  mutate(survived = ifelse(survived==0, "No","Yes"))%>%
  mutate(pclass = as.factor(pclass))
str(titanic)
#> 'data.frame':    1309 obs. of  9 variables:
#>  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ pclass  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ survived: chr  "Yes" "Yes" "No" "No" ...
#>  $ name    : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
#>  $ sex     : chr  "female" "male" "female" "male" ...
#>  $ age     : num  29 0.917 2 30 25 ...
#>  $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
#>  $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
#>  $ fare    : num  211 152 152 152 152 ...
no_age <- which(is.na(titanic$age))
titanic <- titanic[-no_age,]
summary(titanic)
#>        X          pclass    survived        
#>  Min.   :   1.0   1:284   Length:1046       
#>  1st Qu.: 299.2   2:261   Class :character  
#>  Median : 575.5   3:501   Mode  :character  
#>  Mean   : 600.2                             
#>  3rd Qu.: 875.5                             
#>  Max.   :1309.0                             
#>                                             
#>      name               sex                 age         
#>  Length:1046        Length:1046        Min.   : 0.1667  
#>  Class :character   Class :character   1st Qu.:21.0000  
#>  Mode  :character   Mode  :character   Median :28.0000  
#>                                        Mean   :29.8811  
#>                                        3rd Qu.:39.0000  
#>                                        Max.   :80.0000  
#>                                                         
#>      sibsp            parch             fare       
#>  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
#>  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  8.05  
#>  Median :0.0000   Median :0.0000   Median : 15.75  
#>  Mean   :0.5029   Mean   :0.4207   Mean   : 36.69  
#>  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 35.50  
#>  Max.   :8.0000   Max.   :6.0000   Max.   :512.33  
#>                                    NA's   :1
set_three <- titanic %>%
  mutate(child = age < 16) %>%
  group_by(child)%>%
  count(survived) %>%
  mutate(percent = n/sum(n)*100)
set_three
#> # A tibble: 4 × 4
#> # Groups:   child [2]
#>   child survived     n percent
#>   <lgl> <chr>    <int>   <dbl>
#> 1 FALSE No         570    61.2
#> 2 FALSE Yes        361    38.8
#> 3 TRUE  No          49    42.6
#> 4 TRUE  Yes         66    57.4

Assignment 5 Pick and code a graph that you think looks good. What does the data tell you about survival?

Assignment 6 Insert comments into the code chunk below.

set_four <- titanic %>%
  mutate(senior = ifelse( age >= 60, "senior","")) %>%
  group_by(senior) %>%
  count(survived) %>%
  mutate(percent = n/sum(n)*100)
set_four
#> # A tibble: 4 × 4
#> # Groups:   senior [2]
#>   senior   survived     n percent
#>   <chr>    <chr>    <int>   <dbl>
#> 1 ""       No         591    58.7
#> 2 ""       Yes        415    41.3
#> 3 "senior" No          28    70  
#> 4 "senior" Yes         12    30

Assignment 7 Pick and code a graph that you think looks good. What does the data tell you about survival?

8.5 Now let’s combine all of the above

Assignment 8 Insert comments into the code chunk below.

n_obs <- nrow(titanic)
age_group <- rep("adult",n_obs)
for (i in 1:n_obs) {
age_group[i] <- if (titanic$age[i]<16) "child" else if (titanic$age[i]>60) "senior" else "adult"
}

titanic <- data.frame(age_group,titanic)

set_five <- titanic %>%
# mutate(child = age < 16,senior = age >= 60) %>%
  mutate(group_name = paste("class", as.character(pclass),",", sex,",", age_group))%>%
  group_by(pclass, sex, age_group, group_name)%>%
  count(survived) %>%
  mutate(percent= n/sum(n)*100)

set_five<- set_five %>%
  arrange(desc(n))
set_five
#> # A tibble: 30 × 7
#> # Groups:   pclass, sex, age_group, group_name [17]
#>    pclass sex    age_group group_name survived     n percent
#>    <fct>  <chr>  <chr>     <chr>      <chr>    <int>   <dbl>
#>  1 3      male   adult     class 3 ,… No         256    84.8
#>  2 2      male   adult     class 2 ,… No         129    92.1
#>  3 1      female adult     class 1 ,… Yes        121    97.6
#>  4 1      male   adult     class 1 ,… No          84    64.1
#>  5 2      female adult     class 2 ,… Yes         76    87.4
#>  6 3      female adult     class 3 ,… No          62    54.4
#>  7 3      female adult     class 3 ,… Yes         52    45.6
#>  8 1      male   adult     class 1 ,… Yes         47    35.9
#>  9 3      male   adult     class 3 ,… Yes         46    15.2
#> 10 3      male   child     class 3 ,… No          29    69.0
#> # ℹ 20 more rows

ggplot(set_five, aes(x=group_name, y=n, fill=survived)) + 
  geom_bar(position="dodge", stat="identity") +
  labs(title = "Number of people surviving Titanic", x="Group",y="Total survivors in group")+
  scale_fill_manual(values = c("red","green"))+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))


ggplot(data=set_five, aes(x=reorder(group_name,n,decreasing = TRUE),y=n))+ 
  geom_bar(stat="identity", fill="blue") +
  labs(title = "Number of people surviving titanic",x="Group",y="Total survivors in group")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

set_five<- set_five %>%
  arrange(percent)
set_five
#> # A tibble: 30 × 7
#> # Groups:   pclass, sex, age_group, group_name [17]
#>    pclass sex    age_group group_name survived     n percent
#>    <fct>  <chr>  <chr>     <chr>      <chr>    <int>   <dbl>
#>  1 1      female adult     class 1 ,… No           3    2.42
#>  2 1      male   senior    class 1 ,… Yes          1    6.67
#>  3 2      male   adult     class 2 ,… Yes         11    7.86
#>  4 2      male   child     class 2 ,… No           1    8.33
#>  5 2      female adult     class 2 ,… No          11   12.6 
#>  6 3      male   adult     class 3 ,… Yes         46   15.2 
#>  7 1      female senior    class 1 ,… No           1   16.7 
#>  8 2      male   senior    class 2 ,… Yes          1   16.7 
#>  9 3      male   child     class 3 ,… Yes         13   31.0 
#> 10 1      female child     class 1 ,… No           1   33.3 
#> # ℹ 20 more rows


ggplot(set_five, aes(x=group_name, y=percent, fill=survived)) + 
  geom_bar(position=position_dodge(preserve="single"), stat="identity",width=0.7) +
  labs(title = "Percent of people surviving Titanic",x="Group",y="Percent survivors in group")+
  scale_fill_manual(values = c("red","green"))+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

keep <- which(set_five$survived=="Yes")
set_five_survived <- set_five[keep,]

set_five_survived
#> # A tibble: 16 × 7
#> # Groups:   pclass, sex, age_group, group_name [16]
#>    pclass sex    age_group group_name survived     n percent
#>    <fct>  <chr>  <chr>     <chr>      <chr>    <int>   <dbl>
#>  1 1      male   senior    class 1 ,… Yes          1    6.67
#>  2 2      male   adult     class 2 ,… Yes         11    7.86
#>  3 3      male   adult     class 3 ,… Yes         46   15.2 
#>  4 2      male   senior    class 2 ,… Yes          1   16.7 
#>  5 3      male   child     class 3 ,… Yes         13   31.0 
#>  6 1      male   adult     class 1 ,… Yes         47   35.9 
#>  7 3      female adult     class 3 ,… Yes         52   45.6 
#>  8 3      female child     class 3 ,… Yes         19   51.4 
#>  9 1      female child     class 1 ,… Yes          2   66.7 
#> 10 1      female senior    class 1 ,… Yes          5   83.3 
#> 11 2      female adult     class 2 ,… Yes         76   87.4 
#> 12 2      male   child     class 2 ,… Yes         11   91.7 
#> 13 1      female adult     class 1 ,… Yes        121   97.6 
#> 14 2      female child     class 2 ,… Yes         16  100   
#> 15 1      male   child     class 1 ,… Yes          5  100   
#> 16 3      female senior    class 3 ,… Yes          1  100

ggplot(data=set_five_survived, aes(x=reorder(group_name,percent,decreasing = TRUE),y=percent))+ 
  geom_bar(stat="identity", fill="lightblue") +
  labs(title = "Percent of people surviving Titanic",x="Group",y="Percent survivors in group")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Assignment 9 Have a closer look at the three graphs below. Pick the one you like best, and and change the colors to something better.

ggplot(set_five, aes(x=group_name, y=percent, fill=survived, label = paste(round(percent),"%,", n))) + 
  geom_bar(position="stack", stat="identity") +
  labs(title = "percent of people surviving titanic",x="",y="percent")+
  scale_fill_manual(values = c("black","Pink"))+
  geom_text(size = 3, position = position_stack(vjust = 0.5), color="white")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplot(set_five, aes(x=group_name, y=percent, fill=survived, label = paste(round(percent),"%,", n))) + 
  geom_bar(position="stack", stat="identity") +
  labs(title = "percent of people surviving titanic",x="",y="Percent")+
  scale_fill_manual(values = c("black","Pink"))+
  geom_text(size = 3, position = position_stack(vjust = 0.5), color="magenta")+
  theme(axis.text.x = element_text( vjust = 1))+coord_flip()

ggplot(set_five, aes(x=group_name, y=percent, fill=survived, label = paste(round(percent),"%,", n))) + 
  geom_bar(position="stack", stat="identity") +
  labs(title = "percent of people surviving titanic",x="",y="Percent")+
  scale_fill_manual(values = c("black","Pink"))+
  geom_text(size = 3, position = position_stack(vjust = 0.5), color="magenta")+
  theme(axis.text.x = element_text( vjust = 1), axis.text.y=element_text(hjust=0))+coord_flip()

Assignment 10 Fix the graph below. It should look roughly like this:

ggplot(set_five, aes(x=group_name, y=percent, fill=survived, label = paste("this is weird",round(percent),"%,", n))) + 
  geom_bar(position="stack", stat="identity",alpha=0.1) +
  labs(title = "Bad graph,percent of people surviving titanic",x="Should this be here?",y="Percent")+
  scale_fill_manual(values = c("darkgreen","darkgreen"))+
  geom_text(size = 3, position = position_stack(vjust = 0.5), color="white")+
  theme(axis.text.x = element_text(vjust = -4, hjust=12, color="Magenta"), 
        axis.text.y=element_text(hjust=2,face="bold.italic",color="blue",size=16))+
  coord_flip(xlim=c(0,2000), ylim=c(0,10))

Assignment 11

  1. What is your final answer, what does the data tell you about survival?

  2. Discuss the reliability of your answer. You can find more info on the titanic here: https://www.historyonthenet.com/how-many-people-were-on-the-titanic.