library(tidyverse)
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-09/wheels.csv') wheels
Data this #TidyTuesday 2022 week32 is from EmilHvitfeldt.
This dataset contains information about ferris wheels around the world. The objective of this visualization is to provide a visualization of the differences among ferris wheels in terms of diameter and number of cabins. There is a 48% and 15% missing values in the diameter, and number of cabins vectors, respectively. So, what we need to do is find a way to impute these missing values. There are several techniques used for imputing missing values, such as trees, KNN, SVM, mean/median, linear, and polynomial imputations. We could use {tidymodels} and provided step_functions()
in order to do that, passing through the use of the recipe()
function, or workout a custom way to imputation. This is when you look at the data and impute missing values based on the overall meaning. I’ll call it custom sintetic imputation.
Before to start tidying our dataset, we exclude the Nippon wheel as it is the only one without height and it is still on development, searching on the internet, it is difficult to find information about it, so we exclude it. The, a second ferris wheel which requires some attention is the Big O, it has a height which is lower than its diameter, and if you Google it: Big O you’ll find that both height and diameter are 60 mt (197 feet), so I decided to swap the values, considering the higher value as the height and the other as the diameter.
Moscow-850 is expressed in mt so it needs to be in ft, about 240 ft. Chicago Wheel is in mt as well, so considering it in ft, it would be
<- wheels%>%
wheels1 filter(!str_detect(name,"Nippon"))%>% # excluded
mutate(location=ifelse(is.na(location),"Dubai",location),
diameter=case_when(name=="Moscow-850"~240,
=="Chicago Wheel"~height,
nameTRUE~diameter),
temp=ifelse(name=="Big O",height,diameter),
height=ifelse(name=="Big O",diameter,height),
diameter=ifelse(name=="Big O",temp,diameter))
%>%
wheels1select(name,country,height,diameter)%>%
filter(name%in%c("Big O","Moscow-850","Chicago Wheel"))
Then, another cleaning step involves the cost
and tickets
variables (renamed from construction_cost
and ticket_cost_to_ride
), these two vectors are important to be just numeric, because we are going to use them inside a model to predict suitable missing values for the number of cabins
; in addition, when multiple values are provided for tickets, such as adult and children, we include just the adults ticket prices. In order to do that, we make some text customization, named regex, text regular expressions with the help of stringr::str_extract()
function, as shown below. We have some missing values among those two vectors, 55% and 43% for costs and tickets, respectively. Also, these two vector will be imputed to be used in our model.
%>%
wheels1mutate(costs=stringr::str_extract(construction_cost, "\\d+\\.*\\d*"),
costs=as.numeric(costs),
tickets=stringr::str_extract(ticket_cost_to_ride, "\\d+\\.*\\d*"),
tickets=as.numeric(tickets)) %>%
filter(!is.na(costs))%>%
select(name, country,costs,tickets)%>%head
As said, we suppose that height is always higher than diameter. A Ferris wheel is generally, lifted on a wheel support and the height is the height of the wheel plus its support. So that, the diameter is just the diameter of the wheel in itself. We do not have missing values on the height vector, the only missing value (Nippon wheel) has been excluded. To replace the missing values in the diameter vector, we have a look at the relationship between diameter versus height. It is about linear.
More considerations are about the ferris wheel support height, the height of the rim steel structure of the ferris wheel, this is given by the difference between the overall height and radius of the ferris wheel. Here is considered just the distance from the end of the wheel and the ground base of the support. We consider the median difference of the height minus the diameter of the all ferris wheel in the dataset, and build a sintetic_diameter vector filling missing values with the difference between height and the median difference of 33ft.
=median(height-diameter,na.rm = T)
med_bs_hg=ifelse(is.na(diameter),height-med_bs_hg,diameter) sintetic_diameter
To adopt the median difference is quite superficial, but for now consider that as a base structure for all ferris wheel to start with. 33 ft will be the median distance of the ferris wheels from the ground, in our model.
Then, the number of cabins vector contains 15% of missing values, and here these values are covered with some extra manipulations.
Consider some geometry, how to calculate the circumference, the arc length, or the distance among cabins, and the angles such as theta. This apparently a redundant procedure, can lead to reduced bias, as theta is of determined range from 0 to 360 degree or \(2\pi\) radiant. The number of cabins can also be of a certain range, as not more than a well specified number of cabins are allowed depending on diameter of the ferris wheel, but this is unknown and potentially be of any unspecified length. For this reason \(\theta\) could be a good estimator.
The number of cabins are points on a circle, and for those values in the dataset which are not missing, an arc length can be calculated, as long as the central angle \(\theta\). Once these values are set, the missing values can be calculated with a backwards procedure.
\[L=\text{Length of an arc}\] \[r=\text{Radius}\] \[L=\theta\frac{\pi}{180}r\] A sintetic number of cabins (sintetic_n_cab
) vector is set to be filled with imputed missing values. To find these values, some other parameters are needed. The sintetic_diameter
vector can be used to find the circumference, as well as the radius, the arc length among points (the arc_distance
) and finally \(\theta\). All these values are now filled with missing values, as are derived from the number of cabins vector.
= pi*sintetic_diameter
circumference = circumference/number_of_cabins
arc_distance = (arc_distance*180)/(pi*sintetic_diameter/2)
theta = circumference/arc_distance sintetic_n_cab
Final data manipulation below shows how new features are made ready to be used in the model. For further reference Feature Engineering and Selection - Engineering Numeric Predictors, pg.122 is about expanding individual predictors into many predictors.
<- wheels1%>%
my_df select(-temp,-`...1`) %>%
mutate(med_bs_hg=median(height-diameter,na.rm = T),
sintetic_diameter=ifelse(is.na(diameter),height-med_bs_hg,diameter),
#sintetic_height=sintetic_diameter+med_bs_hg,
circumference=pi*sintetic_diameter,
arc_distance=circumference/number_of_cabins,
theta=(arc_distance*180)/(pi*sintetic_diameter/2),
sintetic_n_cab=circumference/arc_distance)%>%
select(-construction_cost,-ticket_cost_to_ride)%>%
arrange(country)
So, we are left with some missing values in the number of cabins. Before going to modeling, some missing values can be filled considering the mean of the arc_distance for the group of heights, this values will be used to fill some of the missing number of cabins for those values with a common height.
Looking at the distribution of \(\theta\), it is clear a right skewness of the distribution. As said, the angle \(\theta\), which is between (0,360), might be a good estimator to use for estimating missing number of cabins, passing through the arch length, the distance among cabins on the wheel.
%>%
my_df1ggplot(aes(theta))+
geom_histogram(bins=10,color="white")+
::theme_fivethirtyeight()+
ggthemeslabs(title=expression(paste("Distribution:\t",theta)))
A trees based model would be the best choice for guessing these values (ct. Feature Engineering and Selection - Engineering Numeric Predictors, pg.121).
One more consideration is due to the dimension of the dataset:
%>%dim wheels
Cross validation could be a solution, as it shuffles data on specified number of folds. In this case the group_vfold_cv()
function is used to make the cross validation folds grouped by height while predicting theta.
More about how to extrapolate vfold_cv() assessment/analysis datasets here: article
library(tidymodels)
tidymodels_prefer()
set.seed(1234)
<- group_vfold_cv(my_df1, group = height,v = 10) folds
<- my_df1%>%
rec recipe(theta~circumference+height+sintetic_diameter+sintetic_n_cab) %>%
step_corr(all_numeric()) %>%
step_impute_bag(theta,sintetic_n_cab)
%>%prep()%>%bake(new_data=NULL)%>%DataExplorer::profile_missing() rec
Set a Random forest engine randomForest with tuning parameters. mtry is for feature subset strategy, in this case just three predictors are used. min_n is the min node size, and then there is the number of trees.
show_engines("rand_forest")
show_model_info("rand_forest")
<-
rf_spec rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine('randomForest') %>%
set_mode('regression')
<- workflow() %>%
rf_wfl add_model(rf_spec) %>%
add_recipe(rec)
::registerDoParallel()
doParallelset.seed(123)
<- tune_grid(object = rf_wfl,
rf_res resamples = folds,
grid = 20,
control = control_grid(save_pred = T,
save_workflow = T,
verbose = T,
parallel_over = "everything"))
%>%
rf_res collect_metrics()%>%
filter(.metric=="rmse")%>%
arrange(mean)%>%
select(mtry,trees,min_n,mean)%>%
pivot_longer(cols=mtry:min_n,names_to="parameters",values_to="values")%>%
ggplot(aes(values,mean, color=parameters))+
geom_point(show.legend = F)+
facet_wrap(~parameters,scale="free")
<- grid_regular(
rf_grid trees(range = c(500,1000)),
min_n(range = c(0,5)),
mtry(range=c(2,2)),
levels = 5
) rf_grid
Second tuning with a specified range of parameters.
set.seed(456)
<- tune_grid(object = rf_wfl,
rf_res2 resamples = folds,
grid = rf_grid,
control = control_grid(save_pred = T,
save_workflow = T,
verbose = T,
parallel_over = "everything"))
%>%
rf_res2 collect_metrics()%>%
filter(.metric=="rmse")%>%
arrange(mean)%>%
select(mtry,trees,min_n,mean) %>%
mutate(min_n=as.factor(min_n))%>%
ggplot(aes(trees,mean, color=min_n))+
geom_line()+
geom_point(show.legend = F)
select_best(rf_res2,"rsq")
<-my_df1%>%
my_df2filter(!is.na(theta))
<- my_df1%>%
testfilter(is.na(theta))
<- workflow()%>%
pred_theta add_model(rf_spec)%>%
add_formula(formula=theta~sintetic_diameter)%>%
finalize_workflow(select_best(rf_res2)) %>%
fit(data = my_df2) %>%
augment(new_data=test) %>%
select(sintetic_n_cab,arc_distance,theta,.pred)
pred_theta
$theta[is.na(my_df1$theta)] <- pred_theta$.pred
my_df1
<- my_df1%>%
my_imputed_df mutate(sintetic_distance=ifelse(is.na(arc_distance),
*pi*sintetic_diameter)/360,arc_distance),
(thetasintetic_n_cab=ifelse(is.na(number_of_cabins),
/sintetic_distance),sintetic_n_cab))%>%
(circumferenceselect(name,country,height,sintetic_diameter,diameter,sintetic_n_cab)
%>%head my_imputed_df
<- my_imputed_df %>%
my_imputed_df1 mutate(y = height-sintetic_diameter/2,
r = sintetic_diameter / 2) %>%
group_by(country) %>%
mutate(country_id=1,
name=paste0(name,"\n",country),
tot_wheels=sum(country_id),.after=country)%>%
ungroup() %>%
mutate(country=as.character(country)) %>%
mutate(name=case_when(name=="Diamond and Flower Ferris Wheel\nJapan"~"Diamond and Flower\nFerris Wheel\nJapan",TRUE~name))
%>%head my_imputed_df1
This code is from EmilHvitfeldt, who has provided the data for this #TidyTuesday 2022 week32.
<- my_imputed_df1 %>%
cabin group_by(name) %>% # summarise(number_of_cabins)
summarise(cabin = seq_len(sintetic_n_cab),
# Get x and y for the carts
cabin_x = cos(cabin / sintetic_n_cab * 2 * pi),
cabin_y = sin(cabin / sintetic_n_cab * 2 * pi),
# Size them to be the right distance from the center
cabin_x = cabin_x * (sintetic_diameter / 2),
cabin_y = cabin_y * (sintetic_diameter / 2),
# Make sure the carts are raised enough
cabin_y = cabin_y + height - sintetic_diameter / 2,
# Lower the carts just a bit so it appears they are hanging
cabin_y = cabin_y - 12.5,
cabin_color = as.character(cabin %% 3),
sintetic_diameter,sintetic_n_cab,.groups = "drop"
)
%>%
my_imputed_df1 # filter(str_detect(name,"Flower"))%>%select(name)
ggplot() +
geom_abline(slope = 0, intercept = 0, color = "darkgreen") +
ylim(0, NA) +
::geom_circle(aes(x0 = 0, y0 = y, r = r),
ggforcecolor="midnightblue") +
# # Left leg
geom_segment(aes(x = -(height - sintetic_diameter / 2)/2,
xend = 0,
yend = height - sintetic_diameter / 2,
y = 0)) +
# right leg
geom_segment(aes(x = (height - sintetic_diameter / 2)/2,
xend = 0,
yend = height - sintetic_diameter / 2,
y = 0)) +
geom_point(data = cabin,
mapping=aes(cabin_x, cabin_y,
color = cabin_color,
fill = cabin_color),
size=0.01,
shape = 24) +
labs(title="Ferris Wheels: overview by dimensions",
subtitle = "Diameters and number of cabins are imputed. Ferris wheels are ordered by number of cabins.\nOn average: distance from base ground is 33ft, sintetic diameters are 271ft and number of cabins are 42.\n",
caption = "Data are from #TidyTuesday 2022 week 32 {ferriswheels} by @Emil_Hvitfeldt\nDataViz: Federica Gazzelloni @fgazzelloni")+
theme_void()+
theme(text = element_text(family="Roboto Condensed", size=14),
plot.title = element_text(size=45),
plot.subtitle = element_text(size=20),
plot.caption = element_text(size=20,hjust = 0.5,vjust=0.5),
plot.background = element_rect(fill="gray80",color="gray80"),
panel.background = element_rect(fill="gray80",color="gray80"),
legend.position = "none")+
facet_wrap(~fct_reorder(name,sintetic_n_cab))
ggsave("w32_ferriswheels.png",
dpi=320,
width = 15,
height = 18)
Just a little check for some of the ferris wheels with out of the mean overall height.
%>%filter(str_detect(name,"Roue|HEP"))%>%select(name,diameter,height,sintetic_diameter) my_imputed_df1
%>%select(sintetic_diameter,sintetic_n_cab)%>%map_dfr(mean) my_imputed_df1