Ex. 1

(a) Using “scan”

Use scan() to read every single cell in the baseball-salaries-2016.txt file, need to set sep="\t" to avoid a 2-gram term from being separated into two words.
Use data.frame() to create a data.frame (class) by the input vector.
Then use write.csv() and write.table() to export the file.

(b) Draw boxplots

salary = Salary ($M)
winning percentage (WPCT) = W/(W+L)

(c) Check correlation

Using cor.test() to check whether the positive relations between salary and winning percentage.
The Pearson’s correlation between salary and winning percentage is 0.617, p<.001, so we can reject the null hypothesis, i.e. the correlation between WPCT and salary is significant.

## 
##  Pearson's product-moment correlation
## 
## data:  sala and wpct
## t = 4.1436, df = 28, p-value = 0.0002856
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3294329 0.7992681
## sample estimates:
##       cor 
## 0.6165296

(d) Scatter plot

Ex. 2

(a) Length of a vector

The sample is shown below. When we shorten a vector (set a random seed) from length 20 to 15, only the first 15 numbers would be present. When we lengthen it to 18, three (18-15=3) NAs would be added.

the results of rnorm(20).

##  [1] -1.21  0.28  1.08 -2.35  0.43  0.51 -0.57 -0.55 -0.56 -0.89
## [11] -0.48 -1.00 -0.78  0.06  0.96 -0.11 -0.51 -0.91 -0.84  2.42

the results of length(x) = 15

##  [1] -1.21  0.28  1.08 -2.35  0.43  0.51 -0.57 -0.55 -0.56 -0.89
## [11] -0.48 -1.00 -0.78  0.06  0.96

the results of length(x) = 18

##  [1] -1.21  0.28  1.08 -2.35  0.43  0.51 -0.57 -0.55 -0.56 -0.89
## [11] -0.48 -1.00 -0.78  0.06  0.96    NA    NA    NA

(b) Operations of a vector

Change the $x<0$ part of x[1:20] to NA. The results is shown below.

##  [1]   NA 0.28 1.08   NA 0.43 0.51   NA   NA   NA   NA
## [11]   NA   NA   NA 0.06 0.96   NA   NA   NA   NA 2.42

computing the the mean, median, and variance of non-NA entries in x[1:18] from (a).

## [1] "mean: 0.55"

## [1] "median: 0.47"

## [1] "var: 0.16"

(c) Change NA to 0

using is.na() to identify the NAs, then replaced them by 0s.

##  [1] 0.00 0.28 1.08 0.00 0.43 0.51 0.00 0.00 0.00 0.00
## [11] 0.00 0.00 0.00 0.06 0.96 0.00 0.00 0.00 0.00 2.42

Ex. 3

(a) Graph

The graph is shown as 3.(b) below.

(b) Tangent line

Find the slope and intercept of the tangent line of $f(x)=x+2e^{(x-3)}+1$ at $x=5$ , then we can use abline() function to draw the tangent line. (Note. in the R abline() function, the intercept is denoted as a, and the slope is denoted as b.)

Slope. $f'(5)=1+2e^{(5-3)}=1+2e^2$
Point. $(x,y)=(5, f(5))=(5, 2e^2+6)$ ( Note. $x=5$ substituted back into the original equation $f(x)$ )
Intercept. based on $y=bx+a$ , and $b=(1+2e^2)$ is known. (follow the parameters of abline().)

$2e^2+6=5(1+2e^2)+a$ , find $a=-8e^2+1$ .

# calculus
expression((x^2-2*x+3), 'x') |> D('x')

## 2 * x - 2

expression((x+2*exp(x-3)+1), 'x') |> D('x')

## 1 + 2 * exp(x - 3)

Ex. 4

(a) `cat()` and `date()`

The codes and outputs are shown below, combine the labels and system time by cat() function.

cat(
  labels = "Today’s date is: ", 
  paste( substr(date(),1,10), substr(date(),21,24)),
  fill = TRUE
)

## Today’s date is:  Tue Mar 22 2022

cat(
  labels = "The time now is: ", 
  paste( substr(date(),12,19)),
  fill = TRUE
)

## The time now is:  23:30:47

(b) Explain the outputs of time.

With the first 2 lines, we assign the system date to the object today (with a class “Date”), then display it as day-month-year format, and the month %b is an abbreviated month in English.
With the 3rd line, we create a sequence that has 10 days, begins today, and generates every other week.
In the 4th line, we extract the weekday name from the object today. In the 5th line, we check every single month’s name from the sequence tenweeks with 10 values.
In the last line, we check the leap seconds (潤秒) before, then display them as the date format.

(today <- Sys.Date())
format(today, "%d %b %Y") 
(tenweeks <- seq(today, length.out=10, by="1 week")) 
weekdays(today) 
months(tenweeks)
as.Date(.leap.seconds)

Ex. 5

(a) Mid-Square Method

By the requirements, the initial values is a 6 digits number, then square it.
How to make sure the 12-digits (square of 6-digits) at the middle of number (including 0, ranging over [0, 999999])? We have introduced the padding method from the stringr package.
The comment of the gen_5a() function. (1) Create a null vector vec. (2) In the for loop, we consider the input or iterate value may not be 6-digits, so we convert it to a character class first, then use str_pad() function to impute 0 at the left side until 12 digits. (3) Use str_sub() to extract the 6 digits from the 4th number to the 9th. (4) Convert the 6-digit number back into the numeric class. (5) Save the numbers in the vector vec with the ith place.
Goodness-of-fit test. We generated 10000 random numbers from the gen_5a() function. The K-S test showed D = 1, p-value < .001, so we reject the null hypothesis, i.e. the random numbers do not follow the uniform distribution.

## Warning in ks.test(res_5a, y = "punif"): ties should not be present for the
## Kolmogorov-Smirnov test

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  res_5a
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

(b) Vax’s generator

The comment of the gen_5b() function. (1) Create a null vector vec. (2) In the for loop, implement a Vax’s generator (69069*x) %% (2^32). (3) Save the numbers in the vector vec with the ith place.
Goodness-of-fit test. We generated 10000 random numbers from the gen_5b() function. For $\chi^2$ test, we separated the 10000 numbers to 10 groups by the intervals of 1/10 of range (max(res_5b) - min(res_5b))/k. Then check the K-S test D = 1, p < .001, we can reject the null hypothesis, i.e. the random numbers do not follow the uniform distribution. However, the $\chi^2$ test showed $\chi^2=6.9936$ with df=9, p=0.6378, we can not reject the null hypothesis, i.e. there are no significant difference between the random numbers and uniform distribution.
K-S test and $\chi^2$ test have different results. But K-S test considers each value, may be better than $\chi^2$ test.

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  res_5b
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

## 
##  Chi-squared test for given probabilities
## 
## data:  chi_5b
## X-squared = 6.9936, df = 9, p-value = 0.6378

(c) Multiplicative congruential generators

The comment of the gen_5c() function. (1) Create a null vector vec. (2) In the for loop, implement a combined generators, (171*x) %% 30269, (172*y) %% 30307, (170*z) %% 30323, and (x/30269 + y/30307 + z/30323) %% 1. (3) Save the numbers in the vector vec with the ith place.
Goodness-of-fit test. We generated 10000 random numbers from the gen_5c() function. Check the K-S test D = 0.010053, p = 0.2643, we can not reject the null hypothesis, i.e. the random numbers do follow the uniform distribution.
Comparison of (a), (b) and (c). We can find the generator of 5(a) and 5(b) can not create a good random numbers iid from uniform distribution (based on K-S test). But the combination generator of 5(c) can do it, because it has a longer cycle.

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  res_5c
## D = 0.010053, p-value = 0.2643
## alternative hypothesis: two-sided

The histograms of 5(a), 5(b) and 5(c) is shown below.

Ex. 6

(a) `sample` and `ceiling(runif)`

We test a small value (=100) and a large value (=1000000).

Although these histograms above look very close to uniform distributions. But from the K-S tests, we can find any random numbers created by sample() or ceiling(runif()) are not actually iid from uniform distributions.

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  sam100
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

## Warning in ks.test(x = cei100, y = "punif"): ties should not be present for the
## Kolmogorov-Smirnov test

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  cei100
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  sam1000000
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

## Warning in ks.test(x = cei1000000, y = "punif"): ties should not be present for
## the Kolmogorov-Smirnov test

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  cei1000000
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

(b) Hand calculators generator

The comment of the gen_6b() function. (1) Create a null vector vec. (2) In the for loop, implement a generators based on uniform and $\pi$ (pi+u)^5 %% 1. (3) Save the numbers in the vector vec with the ith place.
Goodness-of-fit test. We generated 10000 random numbers from the gen_6b() function. Check the K-S test D = 0.0090575, p = 0.3848, we can not reject the null hypothesis, i.e. the random numbers do follow the uniform distribution.
Compare to #5. In 5(c) we find only the combination generator can create random numbers followed the uniform distribution. Then we check independence by the runs test (from snpar package). The p-value of both res_5c or res_6b is at the 0.05 significance level, so we can conclude that the numbers form them are random (independent).

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  res_6b
## D = 0.0090575, p-value = 0.3848
## alternative hypothesis: two-sided

## 
##  Approximate runs rest
## 
## data:  res_5c
## Runs = 5019, p-value = 0.7188
## alternative hypothesis: two.sided

## 
##  Approximate runs rest
## 
## data:  res_6b
## Runs = 4930, p-value = 0.1556
## alternative hypothesis: two.sided

(c) $\phi$ generator

Replaced $pi$ with $\phi=\dfrac{1+\sqrt{5}}{2}$ .
Check the K-S test D = 0.0066451, p = 0.7692, we can not reject the null hypothesis, i.e. the random numbers do follow the uniform distribution. And from the runs test with p-value = 0.1676, we can find the numbers are random (independent).

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  res_6c
## D = 0.0066451, p-value = 0.7692
## alternative hypothesis: two-sided

## 
##  Approximate runs rest
## 
## data:  res_6c
## Runs = 4932, p-value = 0.1676
## alternative hypothesis: two.sided

Appendix

Note. After R 4.0, some useful new features are built-in.

The native pipeline syntax |> replaced the tidy-style pipeline %>%.
The shorthand syntax for anonymous functions \(x) is more compact than function(x).

## 1(a) --------
dat1 <- scan('baseball-salaries-2016.txt', what = "character" , sep="\t")
dat1a <- matrix(dat1[-c(1:6)], ncol=6, byrow=T) |> data.frame()
names(dat1a) <- dat1[1:6]

# export data frame
write.csv(dat1a, 'dat1a.csv')
write.table(dat1a, 'dat1a.txt')

## 1(b) --------
dat1b <- rio::import('baseball-salaries-2016.txt')
par(mfrow=c(1,2))
boxplot((W/(W+L)) ~ League, data = dat1b, ylab = 'WPCT' )
boxplot(`Salary ($M)` ~ League, data = dat1b, ylab = 'Salary')

## 1(c) --------
sala <- dat1b$`Salary ($M)`
wpct <- dat1b$W/(dat1b$W+dat1b$L)
team <- dat1b$Team
cor.test(sala, wpct )

## 1(d) --------
x11()
plot(wpct~sala)
abline( lm(wpct~sala) , lwd=2) 
identify(wpct~sala, labels = team)

## 2(a) --------
set.seed(1234)
x_2a = rnorm(20) |> round(2); print(x_2a, width = 65)
length(x_2a) = 15; print(x_2a, width = 65)
length(x_2a) = 18; print(x_2a, width = 65)

## 2(b) --------
set.seed(1234)
x_2a = rnorm(20) |> round(2)
x_2a[x_2a<0] <- NA
print(x_2a, width = 55)

paste0( 'mean: ',mean(x_2a[1:18], na.rm = T) |> round(2) )
paste0( 'median: ', median(x_2a[1:18], na.rm = T) |> round(2) )
paste0( 'var: ', var(x_2a[1:18], na.rm = T) |> round(2) )

## 2(c) --------
x_2a[is.na(x_2a)] <- 0; print(x_2a, width = 55)

## 3(a) 3(b) --------
eq1 <- \(x){ ifelse(x<=3, (x^2-2*x+3), NA) }
eq2 <- \(x){ ifelse(x>3,(x+2*exp(x-3)+1), NA) }

# draw the curves
curve(eq2, from=0, to=6, col='blue', lwd=2,
      ylim = c(-5,50),
      xlab="x", ylab="f(x)")
curve(eq1, from=0, to=6, col='red', lwd=2,
      xlab="x", ylab="f(x)", add = T)

# add some lines
abline( h=2, lty=2 ) 
abline( b=(1+2*exp(2)), a=-8*exp(2)+1, lty=2 )

## 4(a) --------
cat(
  labels = "Today’s date is: ", 
  paste( substr(date(),1,10), substr(date(),21,24)),
  fill = TRUE
)

cat(
  labels = "The time now is: ", 
  paste( substr(date(),12,19)),
  fill = TRUE
)

## 5(a) --------
library('stringr')
  # x= initial value, k=length, vec=random number vector
gen_5a <- \(x, k){
  vec <- c()
  for (i in 1:k){
    x <- as.character(x^2) |>
      stringr::str_pad( width=12, side='left', pad='0' ) |>
      stringr::str_sub(4, 9) |> 
      as.numeric()
    vec[i] <- x
  }
  return(vec)
}

set.seed(1234)
res_5a <- gen_5a( floor( runif(1)*(10^6) ), 10000 )
print( ks.test(res_5a, y="punif") )

## 5(b)--------
gen_5b <- \(x, k){
  vec <- c()
  for(i in 1:k){
    x <- (69069*x) %% (2^32)
    vec[i] <- x
  }
  return(vec)
}

set.seed(1234)
res_5b <- gen_5b( runif(1), k=10000 )

# ks test
print( ks.test(res_5b, y="punif") )

# chisq test

grp_5b <- \(x, k) {
  int <- seq( min(x) , max(x) , (max(res_5b) - min(res_5b))/k )
  vec <- c()
  for (i in 1:k){
    vec[i] <- sum( x<int[i+1] ) - sum( x<int[i] )
  }
  return(vec)
}

chi_5b <- grp_5b( res_5b, k=10 ) 
chisq.test( chi_5b, p= rep(1/10, 10) )

## 5(c) --------
gen_5c <- \(x, y, z){
  vec <- c()
  for(i in 1:10000) {
  x <- (171*x) %% 30269
  y <- (172*y) %% 30307
  z <- (170*z) %% 30323
  u <- (x/30269 + y/30307 + z/30323) %% 1
  vec[i] <- u
}
  return(vec)
} 

res_5c <- gen_5c(x=1, y=2, z=3) 
print( ks.test(res_5c, y="punif") )

# hist
par( mfrow=c(2,2) )
hist(res_5a, 20); hist(res_5b, 20); hist(res_5c, 20)

## 6(a) --------
set.seed(1234)
# 100
sam100 <- sample(100)
cei100 <- ceiling(runif(100)*100)
sam1000000 <- sample(1000000)
cei1000000 <- ceiling(runif(1000000)*1000000)

# graphical 
par(mfrow=c(2,2))
hist(sam100, 20); hist(cei100, 20)
hist(sam1000000, 20); hist(cei1000000, 20)

# test
print(ks.test(x=sam100, y="punif")) 
print(ks.test(x=cei100, y="punif")) 
print(ks.test(x=sam1000000, y="punif")) 
print(ks.test(x=cei1000000, y="punif")) 

## 6(b) --------
set.seed(1234)
gen_6b <- \(k){
  vec <- c()
  u <- runif(1)
  for(i in 1:k){
    u <- (pi+u)^5 %% 1
    vec[i] <- u
  }
  return(vec)
}

res_6b <- gen_6b(10000)
print( ks.test(res_6b, y="punif") )
snpar::runs.test( res_5c )
snpar::runs.test( res_6b )

# 6(c) --------
set.seed(1234)
gen_6c <- \(k){
  vec <- c()
  u <- runif(1)
  phi <- (1+sqrt(5))/2
  for(i in 1:k){
    u <- (phi+u)^5 %% 1
    vec[i] <- u
  }
  return(vec)
}

res_6c <- gen_6c(10000)
print( ks.test(res_6c, y="punif") )
snpar::runs.test( res_6c )

# hist
par(mfrow=c(1,2))
hist(res_6b, 20); hist(res_6c, 20)

[Comp&Simu] Assignment 1