15 Homophily and Exponential Random Graphs (ERGM)
15.1 Homophily
In this tutorial, we cover how to A) calculate homophily on a network and B) fit exponential random graphs to networks. First, we will need one of the Add Health data sets that we have been playing around with in previous tutorials. The code below downloads the data from Moreno’s website and converts it into an igraph object. We went over this in the Transitivy tutorial, so here, I just paste the code.
Through this tutorial, we will rely on igraph to analyze the comm59 Add Health network that we made use of last class.
# read in the edge list from our github
<- read.table("https://raw.githubusercontent.com/mahoffman/stanford_networks/main/data/comm59.dat.txt", header = T)
el # Read in attributes from our github
<- read.table("https://raw.githubusercontent.com/mahoffman/stanford_networks/main/data/comm59_att.dat.txt", header = T)
attributes # add an ID column
$ID <- 1:nrow(attributes)
attributes
# Indexing data so that you only put in certain columns
<- el[,1:2] # We will ignore the ranking variable for now.
el_no_weight <- as.matrix(el_no_weight) # igraph requires a matrix
el_no_weight
# convert ids to characters so they are preserved as names
1] <- as.character(el_no_weight[,1])
el_no_weight[,2] <- as.character(el_no_weight[,2])
el_no_weight[,
# Graph the network
library(igraph)
<- graph.edgelist(el_no_weight, directed = T)
net59
# Finally, add attributes
# First link vertex names to their place in the attribute dataset
<- match(V(net59)$name, attributes$ID)
linked_ids
# Then we can use that to assign a variable to each user in the network
V(net59)$race <- attributes$race[linked_ids]
V(net59)$sex <- attributes$sex[linked_ids]
V(net59)$grade <- attributes$grade[linked_ids]
V(net59)$school <- attributes$school[linked_ids]
<- delete.vertices(net59, which(is.na(V(net59)$sex) | V(net59)$sex == 0))
net59 <- delete.vertices(net59, which(is.na(V(net59)$race) | V(net59)$race == 0))
net59 <- delete.vertices(net59, which(is.na(V(net59)$grade) | V(net59)$grade == 0)) net59
Great, now that we have the network, we can evaluate homophily. We can either use igraph’s built in function…
assortativity(net59, types1 = as.numeric(V(net59)$sex))
## [1] 0.2180534
Or do it ourselves. If you’ll remember from class on Tuesday, assortativity on variable is calculated by simply correlating the values of an attribute for every ego-alter pair in the network. We just grab the edgelist, match in ego and alter’s values for the variable of interest, and correlate with cor.test.
<- data.frame(get.edgelist(net59), stringsAsFactors = F)
df $sex1 <- as.numeric(attributes$sex[match(df$X1, attributes$ID)])
df$sex2 <- as.numeric(attributes$sex[match(df$X2, attributes$ID)])
dfcor.test(df$sex1, df$sex2)
##
## Pearson's product-moment correlation
##
## data: df$sex1 and df$sex2
## t = 14.092, df = 3978, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1882596 0.2474463
## sample estimates:
## cor
## 0.2180534
Race, however, is probably better conceptualized as a categorical variable. Assortativity_nominal can be used to evaluate assortativity for categorical variables. It requires a numeric vector, denoting the different categories, which starts at 1, so we add one to the race variable, which starts at 0.
assortativity_nominal(net59, types = as.numeric(V(net59)$race) + 1)
## [1] 0.04103466
15.2 ERGMs
Now imagine, like Wimmer and Lewis, we wanted to calculate homophily for race, but wanted to control for other network factors, such as transitivity, which might lead to a higher degree of same race friendships, but which don’t actually signal an in-group preference. Just like them, we can use exponential random graphs, which model networks as a function of network statistics. Specifically, ERGMs imagine the observed network to be just one instantiation of a set of possible networks with similar features, that is, as the outcome of a stochastic process, which is unknown and must therefore be inferred.
The package that allows one to fit ergm models is part of the statnet (statistical networks) suite of packages. Much like tidyverse, which you might be familiar with, statnet includes a number of complementary packages for the statistical analysis of networks. Let’s install statnet and load it into R.
install.packages("statnet")
library(statnet)
Great, now we can use statnet’s ergm() function to fit our first ERGM. The only problem? Our network is an igraph object rather than a statnet one. There is some good news though. People have built a package for converting igraph objects to statnet and vise versa - intergraph. Let’s install that and load it in too.
install.packages("intergraph")
library(intergraph)
Now that we have intergraph installed and in our R environment, we can use the asNetwork function to convert it to a statnet object. I will also subset the data here, because ERGMs take forever to run and the network is somewhat large (~1000 nodes).
set.seed(1234)
<- sample(V(net59)$name, 350, replace = F)
sampled_students <- igraph::delete.vertices(net59, !V(net59)$name %in% sampled_students)
net59_for_ergm
<- asNetwork(net59_for_ergm)
statnet59 statnet59
## Network attributes:
## vertices = 350
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 547
## missing edges= 0
## non-missing edges= 547
##
## Vertex attribute names:
## grade race school sex vertex.names
##
## No edge attributes
It even ported over our node attributes! You can plot it to see if it looks the same as it did in igraph.
plot(statnet59,
vertex.col = "tomato",
vertex.cex = 1)
With our data ready for analysis using statnet, let’s build our first ERGM. There are a number of potential ERG model terms that we could use in fitting our model. You can view a full list by looking at the documentation for:
?ergm.terms
That said, it is common to build up from some basic terms first. The McFarland-Moody paper we read in class is a useful reference point as is the Wimmer-Lewis paper. Generally, the first term that people use is the edges term. It is a statistic which counts how many edges there are in the network.
<- ergm(statnet59 ~ edges, control = control.ergm(seed = 1234)) random_graph
How do we interpret this coefficient? Coefficients in ERGMs represent the change in the (log-odds) likelihood of a tie for a unit change in a predictor. We can use a simple formula for converting log-odds to probability to understand them better.
<- function(logit){
inv.logit <- exp(logit)
odds <- odds / (1 + odds)
prob return(prob)
}
<- coef(random_graph)
theta inv.logit(theta)
## edges
## 0.004478101
So the probability of an edge being in the graph is roughly 0.02. The probability of an edge being drawn should in theory be the same as density - let’s check.
network.density(statnet59)
## [1] 0.004478101
Nice.
To put it all back into the theoretical underpinnings of ERGMs, we have modeled a stochastic generating process, and the only constraint we have put on the stochastic process is the probability with which edges are drawn, which was set equal to the network density of our observed graph.
We can more closely examine the model fit using the summary() function, just like lm() or glm() in base R.
summary(random_graph)
## Call:
## ergm(formula = statnet59 ~ edges, control = control.ergm(seed = 1234))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -5.40407 0.04285 0 -126.1 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 169336 on 122150 degrees of freedom
## Residual Deviance: 7009 on 122149 degrees of freedom
##
## AIC: 7011 BIC: 7020 (Smaller is better. MC Std. Err. = 0)
We can also simulate graphs using our ERGM fit. We did something similar to this when we simulated random graphs and set the probability of an edge being drawn to be equal to our network’s density. We just didn’t know it at the time!
set.seed(1234)
<- simulate(random_graph,
hundred_simulations coef = theta,
nsim = 100,
control = control.simulate.ergm(MCMC.burnin = 1000,
MCMC.interval = 1000))
Let’s examine the first nine simulations.
par(mfrow = c(3, 3))
sapply(hundred_simulations[1:9], plot, vertex.cex = 1, vertex.col = "tomato")
We can compare the number of edges our observed graph has to the average of the simulated networks.
<- unlist(lapply(hundred_simulations, network.density))
net_densities
hist(net_densities, xlab = "Density", main = "", col = "lightgray")
abline(v = network.density(statnet59), col = "red", lwd = 3, lty = 2)
abline(v = mean(net_densities), col = "blue", lwd = 3, lty = 1)
Pretty close!
Another way to evaluate our model is to use the built-in goodness of fit measures. Essentially, we will evaluate whether our network has similar structural features as the simulated graphs. ergm has a built-in function - gof() - which calculates the statistics for us. We just have to tell it how many simulations we want to use in the comparison set - the larger the number, the more accurate representation of the model.
<- gof(random_graph)
gof_stats
par(mfrow = c(2, 3))
plot(gof_stats, main = '')
On one measure, edgewise shared partners, our model looks okay. On the others, especially in and out degree, it looks awful. How to we improve our fit? By adding more terms to the model!
First, let’s build a model with only dyad-independent terms. Just like with the random graph, we are essentially fitting a logistic regression.
nodematch is the homophily term in ergm. We can specify the attribute we want to examine as well as the diff argument, which allows for differential levels of homophily for different groups.
<- ergm(statnet59 ~ edges +
model1 nodematch("race") +
nodematch("sex") +
nodematch("grade"))
summary(model1)
## Call:
## ergm(formula = statnet59 ~ edges + nodematch("race") + nodematch("sex") +
## nodematch("grade"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.91859 0.12039 0 -57.468 < 1e-04 ***
## nodematch.race 0.05272 0.09125 0 0.578 0.56341
## nodematch.sex 0.26996 0.08717 0 3.097 0.00195 **
## nodematch.grade 2.87780 0.10361 0 27.774 < 1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 169336 on 122150 degrees of freedom
## Residual Deviance: 6007 on 122146 degrees of freedom
##
## AIC: 6015 BIC: 6054 (Smaller is better. MC Std. Err. = 0)
Every variable is significant! Grade and race have especially large coefficients and all three are positive.
Let’s try it with diff set to T. We will limit our examination to only grade/race/sex categories represented by a large number of vertices in our network. You can examine this using the table function
table(V(net59_for_ergm)$race) # 1 = white, 2 = black, 3 = hispanic, 4 = asian, and 5 = mixed/other
table(V(net59_for_ergm)$sex) # 1 = male, 2 = female
table(V(net59_for_ergm)$grade)
<- ergm(statnet59 ~ edges +
model2 nodematch("race", diff = T, levels = c("1", "2", "5")) +
nodematch("sex", diff = T, levels = c("1", "2")) +
nodematch("grade", diff = T, levels = as.character(c(7:12))))
summary(model2)
## Call:
## ergm(formula = statnet59 ~ edges + nodematch("race", diff = T,
## levels = c("1", "2", "5")) + nodematch("sex", diff = T, levels = c("1",
## "2")) + nodematch("grade", diff = T, levels = as.character(c(7:12))))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.96247 0.12167 0 -57.226 < 1e-04 ***
## nodematch.race.1 0.11676 0.09397 0 1.243 0.214032
## nodematch.race.2 -Inf 0.00000 0 -Inf < 1e-04 ***
## nodematch.race.5 -0.07078 0.28052 0 -0.252 0.800805
## nodematch.sex.1 0.17007 0.11570 0 1.470 0.141582
## nodematch.sex.2 0.34239 0.09813 0 3.489 0.000485 ***
## nodematch.grade.7 2.86262 0.14430 0 19.838 < 1e-04 ***
## nodematch.grade.8 2.65692 0.16632 0 15.975 < 1e-04 ***
## nodematch.grade.9 2.97460 0.12574 0 23.657 < 1e-04 ***
## nodematch.grade.10 2.42069 0.19922 0 12.151 < 1e-04 ***
## nodematch.grade.11 2.54508 0.21460 0 11.860 < 1e-04 ***
## nodematch.grade.12 3.31829 0.14238 0 23.306 < 1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 169336 on 122150 degrees of freedom
## Residual Deviance: 5976 on 122138 degrees of freedom
##
## AIC: 5998 BIC: 6105 (Smaller is better. MC Std. Err. = 0)
##
## Warning: The following terms have infinite coefficient estimates:
## nodematch.race.2
Interesting! Now let’s add some dyad-dependent terms. These can be a bit finnicky, especially transitivity ones.
mutual is the term for reciprocity.
<- ergm(statnet59 ~ edges +
model3 nodematch("race") +
nodematch("sex") +
nodematch("grade") +
mutual)
summary(model3)
## Call:
## ergm(formula = statnet59 ~ edges + nodematch("race") + nodematch("sex") +
## nodematch("grade") + mutual)
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.95232 0.10987 0 -63.275 <1e-04 ***
## nodematch.race 0.04004 0.07970 0 0.502 0.6154
## nodematch.sex 0.19509 0.07657 0 2.548 0.0108 *
## nodematch.grade 2.32082 0.10386 0 22.345 <1e-04 ***
## mutual 4.39947 0.15846 0 27.764 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 169336 on 122150 degrees of freedom
## Residual Deviance: 5418 on 122145 degrees of freedom
##
## AIC: 5428 BIC: 5477 (Smaller is better. MC Std. Err. = 1.117)
Now let’s add a term for triadic closure. There are a few terms for triads - one of them, triangles, tends to lead to degeneracy. The gwesp term behaves better, though convergence is far from guaranteed.
It may be a good time to use the bathroom. This will take a while…
<- ergm(statnet59 ~ edges +
model4 nodematch("race") +
nodematch("sex") +
nodematch("grade") +
+
mutual gwesp(0.25, fixed = T),
control=control.ergm(MCMLE.maxit= 40))
# you can change a number of other things about the MCMC algorithm - from its burn-in to its step and sample size
# here we just up the maximum iterations we wait to see if it has converged
summary(model4)
Let’s take a look at the goodness-of-fit statistics for this most elaborate model.
<- gof(model4)
model4_gof
par(mfrow = c(3, 2))
plot(model4_gof, main = '')
Definitely an improvement over the random graph.
Lab: Run ergm on a graph of interest to you and include balance (reciprocity and triadic closure) and homophily terms. Interpret the results. Simulate a graph from this model and compare it to your original graph.