16 One Numerical Sample
The Extrasolar Planets Encyclopaedia (https://exoplanet.eu/) offers a catalog of confirmed exoplanets together with some characteristics of these planets. We focus on their mass.
load("data/exoplanet_mass.rda")
= sort(log(mass)) # we use a log scale mass
16.1 Quantiles and other summary statistics
summary(mass) # extremes, quartiles (including the median), and mean
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.07357 -2.12026 -0.05129 -0.42477 1.13140 4.53903
range(mass) # the range giving the minimum and maximum
[1] -13.07357 4.53903
quantile(mass, c(0, 0.2, 0.4, 0.6, 0.8, 1)) # some sample quantiles
0% 20% 40% 60% 80% 100%
-13.0735732 -2.8949824 -0.4942963 0.4187103 1.5347144 4.5390304
median(mass) # sample median
[1] -0.05129329
mean(mass) # sample mean
[1] -0.4247663
sd(mass) # sample standard deviation
[1] 2.372008
The boxplot is based on the extremes and the quartiles. (In this particular version, no points are labeled as outliers. Note that this is not the default behavior of this function.)
boxplot(mass, range = Inf, horizontal = TRUE, col = "grey", xlab = "mass (log scale)")
rug(mass, lwd = 2) # adds a mark at each data points
16.2 Empirical distribution function
= ecdf(mass) # empirical distribution function
Fun plot(Fun, lwd = 2, xlab = "mass (log scale)", ylab = "cumulative probability", main = "empirical distribution function")
16.3 Histogram
hist(mass, col = "grey") # bins chosen automatically
hist(mass, breaks = 100, xlab = "mass (log scale)", col = "grey") # set number of bins
16.4 Confidence interval for the median
We compute a confidence interval for the median mass of an exoplanet in our system. There is an important caveat. It is not at all clear how the data we are using were collected, and chances are, it’s complicated given the context. In particular, it is not obvious that the sample we have available to us is in some real sense representative of the entire population of exoplanets ‘out there’. Nonetheless, in what follows, we pretend this is the case, and that in fact the sample is iid from that population. (Although arguably farfetched, this leap of faith is common in real data analysis.)
= 0.01
alpha = length(mass)
n = 1 - pbinom(0:n, n, 1/2) + dbinom(0:n, n, 1/2) # refer to the book
q = which(q >= 1 - alpha/2)
tmp = tmp[length(tmp)]
lower = which(q <= alpha/2)
tmp = tmp[1]
upper c(lower, upper)] # 99% CI for the median mass (log scale) mass[
[1] -0.1660546 0.1397619
exp(mass[c(lower, upper)]) # 99% CI for the median mass (original scale)
[1] 0.847 1.150
16.5 Bootstrap confidence interval
With the same caveat, we move to compute a bootstrap confidence interval for the mean mass. We work in log scale. (If we want an interval in the original scale, we need to work in the original scale. This is because the mean, unlike the median, is not equivariant with respect to monotone transformations.)
Although it is straightforward to implement this by hand, for illustration purposes, we use a package. We compute the percentile bootstrap confidence interval. (Note that the level is approximate, as usual with bootstrap confidence intervals.)
require(bootstrap)
= 1e4 # number of bootstrap replicates
B = bootstrap(mass, B, mean)
mean_boot = mean_boot[[1]] # bootstrapped means
mean_boot = 2*mean(mass) - quantile(mean_boot, c(1-alpha/2, alpha/2))
int attributes(int) = NULL
int
[1] -0.5724919 -0.2731621
16.6 Kernel density estimation
We can visualize the sample density using a kernel density estimate. We superimpose it to a histogram. We explore the choice of different values for the bandwidth.
hist(mass, breaks = 100, freq = FALSE, col = "grey", xlab = "mass (log scale)", main = "")
# various choices of bandwidth
= seq(0.1, 1, len = 10)
b = heat.colors(10, alpha = 0.5)
c for (i in 1:10){
lines(density(mass, bw = b[i]), lwd = 2, col = c[i])
}lines(density(mass), lwd = 2) # default bandwidth
lines(density(mass, "ucv"), lwd = 2, col = "green") # CV bandwidth
16.7 Kaplan–Meier estimator
Censored data is commonplace in Survival Analysis. We a specialized package from that field. For illustration, we a dataset on smoking that is available in a different package, and with that dataset, we compute the Kaplan–Meier estimators for the two treatments groups. (Note that more variables are available in that dataset, but we only use time-to-relapse, relapse, and treatment group.)
require(survival)
require(asaur)
= survfit(Surv(ttr, relapse) ~ grp, data = pharmacoSmoking) # Kaplan--Meier curve
fit plot(fit, xlab = "Days Until Relapse", ylab = "Estimated Probability of Not Relapsing", lwd = 2, col = c("black", "darkgreen"))
legend("topright", c("Combination", "Patch Only"), col = c("black", "darkgreen"), lty = 1, lwd = 2)