5.2 EM/MLE

For this subchapter we need the following packages:

library(tidyverse)
library(Matrix)
library(sf)
library(data.table)

The fundamental difference between the two estimation strategies is that the MLE Poisson approach allows for overlapping coverage area, whereas the Voronoi assigns each tile in its area to the closest seed.

The MLE Poisson approach is an advancement of the known Bayes estimator, developed by Tennekes et al. (reference). Tennekes et al define an informative prior for the estimation of a tile specific population value. The MLE Poisson estimator uses an iterative approach that takes the posterior value as the new prior for the next iteration. In this analysis we will focus as well on scrutinizing how many iterations are needed for the estimator to converge and if there are structural tile differences in convergence behavior.

For this estimator two inputs are needed, an estimated $\hat{P}$ matrix and a prior for each tile. The dimensions of $\hat{P}$ resembles the $P$ matrix used for the device-to-cell-assocation in the toy-world chapter, so $I * J$ , where $I$ resemble the antennas, $J$ resemble the tiles and the elements define the connection likelhood between the specific tile and antenna. This characteristic allows for overlapping coverage areas per tile.

We define again two specifications for this estimator:

Equal probability $\hat{P}$ with a constant prior
True probability $\hat{P}$ (oracle $\hat{P} = P$ with a constant prior

The second specification would require distinct knowledge for the coverage areas and their likelhood to connect to a specific tile which is very unrealistic. However, it can act as a reference for comparison. The first specification and its performance is what this research is most interested in. $\hat{P}$ in this case is defined in a way that one knows which tiles are covered by which antenna, but not to which degree. If a tile for example is covered by three antennas then each respective element in $\hat{P}$ would have a the value of $1/3$ resembling an equal connection likelihood across each relevant antenna. An equal probability approach significantly reduces the level of information needed to calculate tile specific estimates, as one only needs to know which antennas cover which tiles. The final estimator for both specifications is fomalized as follows:

$\hat{u}^{m+1}_{j} = \hat{u}^{m}_{j}\sum_{i = 1}^{I} c_{i} \frac{p_{ij}}{\sum_{k = 1}^{J} p_{ik} \hat{u}^{m}_{k}}.$

$\hat{u}_{j}$ is the tile specific estimate, $m$ specifies the iteration, $c_{i}$ defines the value of phones that are logged to the $i$ ’th antenna, $p_{ij}$ the tile-antenna specific connection likelihood and $k$ defines only antennas with a positive connection likelihood (relevant antennas for a specific tile).

As mentioned above, we need as input a prior which in our case is a row vector of size $J$ and is constant for each tile (here the value 1). This vector will be marked as $a$ . We further need to define $\hat{P}$ which for the oracle specification will be $P$ , i.e. our true P object defined in the chapter before and for the second specification an equal probability modification of $P$ , which we will specify as $P'$ . We further need our c.vector object, also defined in the chapter before. $c$ is assumed to be known exactly.

census.classified.final.sf <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/census.classified.final.sf.rds")

C.vec.df <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/C.vec.df.final.new.rds")

coverage.areas.final <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/coverage.areas.final.rds")

P.mat <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/P.mat.rds")

# non-informative prior
a.vec <- rep(1, length(census.classified.final.sf$internal.id))
saveRDS(a.vec, file = "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/a.non.inf.vec.rds")

# C vector, adding antennas that have 0 phones to complete the vector, arranging it according to the antenna.ID and saving as vector
c.vec <- C.vec.df %>% 
  right_join(coverage.areas.final, by = "antenna.ID") %>% 
  mutate(phones.sum = case_when(is.na(phones.sum) ~ 0,
                                TRUE ~ phones.sum)) %>% 
  arrange(antenna.ID) %>% 
  dplyr::select(antenna.ID, phones.sum) %>% 
  deframe()
saveRDS(c.vec, file = "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/c.vec.rds")

# P matrix oracle (P)
P.oracle.summ <- summary(P.mat)
P.oracle.dt <- data.table(i = P.oracle.summ$i,
                          j = P.oracle.summ$j,
                          pij = P.oracle.summ$x)
saveRDS(P.oracle.dt, file = "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/P.oracle.dt.rds")


# P matrix equal prob (P')
P.prime.mat.helper <- P.mat
P.prime.mat.helper@x[P.prime.mat.helper@x > 0] <- 1
P.prime.summ <- summary(P.prime.mat.helper)
p.helper.dt <- data.table(i = P.prime.summ$i,
                          j = P.prime.summ$j,
                          pij = P.prime.summ$x)
P.prime.dt <- p.helper.dt[, .(i = i, pij = pij / sum(pij)), by = "j"]
setcolorder(P.prime.dt, c("i", "j", "pij"))
saveRDS(P.prime.dt, file = "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/P.prime.dt.rds")

Furthermore, we will define our function for the estimator. It is written within the data.table function for computational speed reasons.

# New MLE iteration function by Matyas
SB_u_est_EM_mat <- function(c.vec, P.dt, a.vec, n.iter, ldt = 10^-04) {
  
  
  cdt <- data.table(c = c.vec)
  cdt <- cdt[, .(i = 1:.N, c = c)]
  pij <- cdt[P.dt, on = "i"]
  pij <- pij[c > 0] # remove those lines where c==0 because it will create 0 division
  adt <- data.table(a = a.vec)
  tiles <- adt[, .(j = 1:.N, u0 = a)]
  
  for(m in 0:(n.iter - 1)){
    cat(format(Sys.time()), paste0("---- calculating u", m + 1), "----\n")
    cols <- c("j", paste0("u", m))
    ju <- tiles[, cols, with = F]
    setnames(ju, c("j", "u"))
    pij <- ju[pij, on = "j"]
    denom <- pij[, .(sum_pik_uk = sum(u * pij)), by = i]
    pij <- denom[pij, on = "i"]
    faktor <- pij[, .(f = sum(c * pij / sum_pik_uk)), by = j]
    faktor.adj <- faktor[, f := fifelse(test = {is.na(f) | is.nan(f) | is.infinite(f)}, 1, f)] # if else to assure that the posterior is 1 to secure the same estimand value after ldt
    pij[, c("u", "sum_pik_uk") := NULL]
    tiles <- faktor.adj[tiles, on = "j"]
    tiles <- eval(parse(text = paste0("tiles[, u", m + 1, " := u", m, "* f]")))
    tiles <- eval(parse(text = paste0("tiles[,  u", m + 1, " := fifelse(u", m + 1, " < ldt, 0, u", m + 1, ")]")))
    tiles[, "f" := NULL]
  }
  
  return(tiles)
  
}

Our first estimator is the oracle one. We specify for the first results object 500 iterations and then follow through with another results object using the 500th iteration result as the updated prior vector. If we would specify one results object with 1000 iterations the object becomes too big to handle on a machine with 24 GB RAM. One iteration takes about 7 to 10 seconds. Therefore we split this up and then bind them together into u.oracle.final. Furthermore, we create a results object with only selected iterations, u.oracle.selected.

c.vec <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/c.vec.rds")
a.vec <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/a.non.inf.vec.rds")
P.oracle.dt <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/P.oracle.dt.rds")

u.true.oracle <- SB_u_est_EM_mat(c.vec, P.oracle.dt, a.vec, n.iter = 500)
a.vec <- u.true.oracle$u500
u.true.oracle1 <- u.true.oracle %>% 
  select(-u500)
saveRDS(u.true.oracle1, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.oracle.part1.rds")

u.true.oracle2 <- SB_u_est_EM_mat(c.vec, P.oracle.dt, a.vec, n.iter = 500)
saveRDS(u.true.oracle2, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.oracle.part2.rds")


u.true.oracle2.mod <- u.true.oracle2 %>% 
  dplyr::select(-j, -u0)

oracle.names <- c("internal.id", paste0("u.oracle.", 1:1000))

u.oracle.final <- bind_cols(u.true.oracle.1, u.true.oracle2.mod)
names(u.oracle.final) <- oracle.names
saveRDS(u.oracle.final, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.oracle.final.rds")

u.oracle.selected <- u.oracle.final %>% 
  dplyr::select(1:11, 51, 101, 201, 301, 401, 501, 601, 701, 801, 901, 1001)

saveRDS(u.oracle.selected, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.oracle.selected.rds")

We do the same for the equal probability estimator.

# Equal probability
c.vec <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/c.vec.rds")
a.vec <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/a.non.inf.vec.rds")
P.prime.dt <- readRDS("C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/working objects/P.prime.dt.rds")

u.equal1 <- SB_u_est_EM_mat(c.vec, P.prime.dt, a.vec, n.iter = 500)
a.vec <- u.equal1$u500
u.equal.mod1 <- u.equal1 %>% 
  select(-u500)
saveRDS(u.equal.mod1, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.equal.part1.rds")

u.equal2 <- SB_u_est_EM_mat(c.vec, P.prime.dt, a.vec, n.iter = 500)
saveRDS(u.equal2, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.equal.part2.rds")

# u.equal.mod <- u.equal %>% 
#   dplyr::select(-c(2, 402, 803))

u.equal2.mod <- u.equal2 %>% 
  dplyr::select(-j, -u0)

equal.names <- c("internal.id", paste0("u.equal.", 1:1000))

u.equal.final <- bind_cols(u.equal.mod1, u.equal2.mod)
names(u.equal.final) <- equal.names
saveRDS(u.equal.final, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.equal.final.rds")

u.equal.selected <- u.equal.final %>% 
  dplyr::select(1:11, 51, 101, 201, 301, 401, 501, 601, 701, 801, 901, 1001)

saveRDS(u.equal.selected, "C:/Users/Marco/Vysoká škola ekonomická v Praze/Tony Wei Tse Hung - YAY/Estimates/u.equal.selected.rds")

We save all necessary objects and will evaluate them in the next chapter.