11 Transitivity, structural balance, and hierarchy

The main goal of this tutorial is to delve more deeply into the microfoundations of networks, such as dyads and tryads. We will learn the basic functions for measuring reciprocity, transitivity, the triad census, and for identifying cliques. We will also learn how to “ban” triads from a graph, and begin to think about whether our graphs deviate, in a statistical sense, from what we might expect by chance, one way of evaluating whether our results are “meaningful”.

Through this tutorial, we will rely on igraph to analyze the comm59 Add Health network that we made use of last class.

# read in the edge list from our github
el <- read.table("https://raw.githubusercontent.com/mahoffman/stanford_networks/main/data/comm59.dat.txt", header = T)
# Read in attributes from our github
attributes <- read.table("https://raw.githubusercontent.com/mahoffman/stanford_networks/main/data/comm59_att.dat.txt", header = T)
# add an ID column
attributes$ID <- 1:nrow(attributes)

Next let’s graph it as a network, ignoring the ranking of friendships for now.

# First read in igraph
library(igraph)

# Indexing data so that you only put in certain columns
el_no_weight <- el[,1:2] # We will ignore the ranking variable for now.
el_no_weight <- as.matrix(el_no_weight) # igraph requires a matrix

# convert ids to characters so they are preserved as names
el_no_weight[,1] <- as.character(el_no_weight[,1])
el_no_weight[,2] <- as.character(el_no_weight[,2])

# Graph the network
net59 <- graph.edgelist(el_no_weight, directed = T)

# Finally, add attributes  
# First link vertex names to their place in the attribute dataset
linked_ids <- match(V(net59)$name, attributes$ID)

# Then we can use that to assign a variable to each user in the network
V(net59)$race <- attributes$race[linked_ids]
V(net59)$sex <- attributes$sex[linked_ids]
V(net59)$grade <- attributes$grade[linked_ids]
V(net59)$school <- attributes$school[linked_ids]

net59 # Great!
## IGRAPH d05711b DN-- 975 4160 -- 
## + attr: name (v/c), race (v/n), sex (v/n), grade (v/n), school (v/n)
## + edges from d05711b (vertex names):
##  [1] 1 ->191 1 ->245 1 ->272 1 ->413 1 ->447 3 ->21  3 ->221 3 ->480 3 ->495
## [10] 3 ->574 5 ->96  5 ->258 5 ->335 6 ->271 6 ->374 6 ->400 6 ->489 6 ->491
## [19] 6 ->573 6 ->586 7 ->134 7 ->159 7 ->464 7 ->478 8 ->221 8 ->284 8 ->378
## [28] 8 ->557 9 ->137 9 ->442 9 ->473 9 ->498 10->20  10->22  10->64  10->75 
## [37] 10->89  10->219 10->272 10->276 11->101 11->155 11->190 11->273 11->337
## [46] 11->339 11->353 11->616 12->188 12->475 14->20  14->151 14->597 15->106
## [55] 15->189 15->233 15->325 15->333 15->449 15->491 15->552 15->624 15->627
## [64] 16->30  16->201 16->217 16->267 16->268 16->466 16->569 17->625 19->45 
## + ... omitted several edges

11.1 The Dyad

We can break large social networks down into their constituent parts. These constituent parts are referred to as “motifs”. The most basic motif consists of two nodes and is called a dyad. Edges in a network signify the presence or absence of dyadic relations. It follows that a dyad in an undirected network can have two unique configurations: connected or disconnected; and three unique configurations in a directed network (mutual, assymetric, and null)

Density captures, at the macro-level, the proportion of dyads that are present over the possible total number of dyads in the network. We are simply re-framing what we discussed last class, except we are focusing on the configuration of nodes as opposed to edges.

graph.density(net59)
## [1] 0.004380561

A related concept is that of reciprocity, a measure which pertains only to directed graphs. Reciprocity is the tendency with which affect, or network ties, sent out by egos are returned by alters. Edges are reciprocal when ego and alter both send each other ties; reciprocity is the graph-level analogue, evaluating the tendency for edges to be reciprocal across the whole network.

reciprocity(net59)
## [1] 0.39375

So our graph has a reciprocity score of 0.39. Is that high or low? It depends on your expectation. If you are from a society with a strong taboo against unrequited affect, then it might seem low. If you come from an individualistic society, it might seem high.

11.2 Generating a random graph for comparison

One way network scholars evaluate whether a given descriptive statistic is high or low is to compare it to the value that obtains under a random network of similar density. In random graphs, the chance that any two dyads are in a relation is determined by chance (i.e. the flip of a coin). This means that the likelihood of observing a given tie is independent from observing a tie between any other dyad.

There are a lot of reasons that this is unrealistic.. For example, if A and B are friends and A and C are friends, then we would expect the probability that B and C are friends to be higher. That is, you are more likely to be friends with your friends friends than with strangers. This is a basic feature of most social networks and one that was emphasized in our readings this week on transitivity. Even in dyads, if someone shows you affection you are more likely to return it. We therefore might want to factor in this human tendency towards triadic and dyadic closure into our null model - later in the class we will discuss ways of doing this. That said, random graphs have many properties that are mathematically and heuristically useful, which is why they are commonly used as null models.

igraph has a fast and easy function for generating random graphs.

?erdos.renyi.game

In an erdos.renyi.graph, each edge has the same probability of being created. We determine the probability and it returns a random graph with a density that equals (in expectation) this probability.

First, we need to calculate the density and number of nodes in our graph.

net59_n <- vcount(net59)
net59_density <- graph.density(net59)

Then we can input those into the erdos.renyi.game function provided by igraph to generate a network of the same size and density, but with edges that are random re-arranged.

random_graph <- erdos.renyi.game(n = net59_n, p.or.m = net59_density, directed = TRUE) # where n is the number of nodes, p.or.m is the probability of drawing an edge, directed is whether the network is directed or not

Let’s take a look at the graph.

plot(random_graph,  
     vertex.size = 2, 
     vertex.label = NA, 
     edge.curved = .1, 
     vertex.color = "tomato", 
     edge.arrow.size = .1, 
     edge.width = .5, 
     edge.color = "grey60")

It looks like a bowl of spaghetti - that is how you know you indeed have a random graph on your hands. What is its reciprocity?

reciprocity(random_graph)
## [1] 0.003703704

0.003! So our network exhibits far more reciprocity than we would expect if people were affiliating randomly.

11.3 The Triad

Triads consist of three nodes and are therefore more complex than dyads, with more possible arrangements.

This becomes clear when you realize an undirected triad consists of three dyads: A and B, B and C, and A and C.

In an undirected graph, there are eight possible triads (2^3… sort of obvious given that each dyad can be present or absent and there are three dyads). Of those eight possible triads, four are isomorphic, so that there are four unique triads.

In an directed graph, there will be sixteen unique triads (the unique number of motifs in directed graphs is not easily reduced to a functional form). These sixteen unique triads give rise to the MAN framework as well as the triad census, which we discussed in class.

The triad census calculates how many triads there are of each type (which, as I just mentioned, in a directed network amounts to 16). If we see a network with very few complete (003) triads, then we know something about the macro-level structure, just by looking at the frequencies of its constituent parts at the micro-level. By extension, if the whole distribution of triads is very different than the distribution that obtains under a random network, then we hopefully learn something about the macro-level structure that we couldn’t observe just by looking at a visualization of the network.

11.4 Calculating a triad census

igraph has a built in function for the triad census: triad.census()

It takes a network object as an argument and returns the number of each type of triad in the network.

triad.census(net59)
##  [1] 150776065   2416731    785249      3621      4325      5165      4028
##  [8]      3619       450        23      1019       294       320       137
## [15]       385       144

As you can see, it returns 16 different numbers. It uses the M-A-N classification structure: M stands for the number of Mutual ties, A stands for the number of Asymmetric ties, and N stands for the number of Null ties.

Mutual means that ego and alter (say A and B) have a mutual relation with each other (A likes B and B likes A). A means that ego and alter have an asymmetric relation with one another, i.e. A likes B but B doesn’t like A… the relation is not reciprocated. Finally, N means that A and B do not have any relation.

The image below visualizes the different types of triads possible in a directed graph.

?triad.census also describes these possible types, telling you the order that the triad types are presented in the triad.census() output.

003 for example means there are 0 mutual relation, 0 asymmetric relations and 3 null relations. This triad contains no relations. 201 would mean there are two mutual relations, zero asymmetric relations, and 1 null relation. Of three dyads in the triad, two dyads are in a relation, and one dyad is not.

For now let’s look at the triad census of our random graph.

triad.census(random_graph)
##  [1] 149843876   4112275      7659      9519      9363     18647        67
##  [8]        57        86        25         0         0         0         1
## [15]         0         0

Most triads are null. This is because edges only have a 5% chance of being drawn. It follows that 95% of the total possible edges are missing.

There are not many cases of transitive triads, or even in which all three dyads have a relationship. Thus, the majority of triads are concentrated in the left side of the triad distribution

Let’s compare it our Add Health network.

triad.census(net59)
##  [1] 150776065   2416731    785249      3621      4325      5165      4028
##  [8]      3619       450        23      1019       294       320       137
## [15]       385       144

Which triads are common in our network, but not in the random graph? What might this tell us?

11.5 Random graphs galore!

One problem is that we are basing our analysis on a single random graph. Because edges are drawn randomly, there is a lot of variation in the structure of random graphs, especially when the number of nodes in the graph is small (less than one thousand).

What we really want is a distribution of random graphs and their triad censuses, against which our own could be compared. So let’s generate one hundred random graphs, and create a distribution of random graph triad censuses and see where our graph lies on that distribution

trial <- vector("list", 100) # this creates a list with 100 spaces to store things.  We will store each result here. 

for ( i in 1:length(trial) ){
  random_graph <- erdos.renyi.game(n = net59_n, p.or.m = net59_density, directed = TRUE) 
  trial[[i]] <- triad.census(random_graph) # We assign to the ith space the result. So for the first iteration, it will assign the result to the first space in the list
}

trial_df <- do.call("rbind", trial) # We can use the do.call and "rbind" functions together to combine all of the results into a matrix, where each row is one of our trials

colnames(trial_df) <- c("003", "012", "102", "021D", "021U", "021C", "111D", "111U", "030T", "030C", "201", "120D", "120U", "120C", "210", "300") # It is worth naming the columns too.

trial_df_w_observed <- rbind(trial_df, as.numeric(triad.census(net59))) # add in the observed results

Now we have this 100 row dataset of simulation results with the observed results tacked on at the end (of course, we could have done 1,000 or 10,000 iterations!) Let’s produce, for each column, some simple statistics, like a mean and a confidence interval.

# First, standardize all of the columns by dividing each of their values by the largest value in that column, so that each will be on a similar scale (0 to 1), we can visualize them meaningfully 
trial_df_w_observed <- as.data.frame(trial_df_w_observed)

trial_df_w_observed[,1:ncol(trial_df_w_observed)] <- sapply(trial_df_w_observed[,1:length(trial_df_w_observed)], function(x) x/max(x))

# Then split the observed from the simulation results
trial_df <- as.data.frame(trial_df_w_observed[1:100,])
observed <- as.numeric(trial_df_w_observed[101,])

# Summarize the simulation results and add the observed data set back in for comparison
summarized_stats <- data.frame(TriadType = colnames(trial_df), 
                               Means = sapply(trial_df, mean), 
                               LowerCI = sapply(trial_df, function(x) quantile(x, 0.05)),
                               UpperCI = sapply(trial_df, function(x) quantile(x, 0.95)), 
                               Observed = observed) 

summarized_stats
##      TriadType        Means     LowerCI      UpperCI  Observed
## 003        003 0.9949328855 0.994278750 0.9955706706 1.0000000
## 012        012 0.9608529038 0.938119928 0.9853509927 0.5885101
## 102        102 0.0113484895 0.006088451 0.0170609577 1.0000000
## 021D      021D 0.9154071799 0.869776554 0.9634861802 0.3834586
## 021U      021U 0.9270289079 0.880053533 0.9740256959 0.4630621
## 021C      021C 0.9239931522 0.880662850 0.9680933019 0.2763214
## 111D      111D 0.0196573982 0.010166336 0.0298659384 1.0000000
## 111U      111U 0.0212683062 0.008565902 0.0326471401 1.0000000
## 030T      030T 0.1702000000 0.144333333 0.2090000000 1.0000000
## 030C      030C 0.6840540541 0.458108108 0.9472972973 0.6216216
## 201        201 0.0001079490 0.000000000 0.0009813543 1.0000000
## 120D      120D 0.0005782313 0.000000000 0.0034013605 1.0000000
## 120U      120U 0.0004062500 0.000000000 0.0031250000 1.0000000
## 120C      120C 0.0021897810 0.000000000 0.0072992701 1.0000000
## 210        210 0.0000000000 0.000000000 0.0000000000 1.0000000
## 300        300 0.0000000000 0.000000000 0.0000000000 1.0000000

Now that we have this dataset, we can use ggplot to plot it.

library(ggplot2)
ggplot(summarized_stats) + 
    geom_point(aes(x=TriadType, y=Means, colour=TriadType)) + 
    geom_errorbar(aes(x = TriadType, ymin=LowerCI, ymax=UpperCI, colour=TriadType), width=.1) +
    geom_point(aes(x=TriadType, y=Observed, colour="Observed")) +
    coord_flip()

Beautiful! What do you see? Which triads are more or less common in our graph than in a random graph? What might this tell us about the macro-structure?

11.6 Producing a tau statistic

Given a weighting scheme, which values some triads and not others, we can evaluate whether our network fits a macrolevel model of our choice. For example, below is a weighting scheme for the ranked-clustering weighting scheme, drawn from Daniel A. McFarland, et al. “Network Ecology and Adolescent Social Structure”. We can compare this to that produced by a random graph to see if our tau count is more or less than we should expect by chance.

weighting_scheme <- c(0,0,0,1,1,-1,0,0,1,0,0,1,1,0,0,0)

sum(triad.census(net59) * weighting_scheme)
## [1] 3845

11.7 Banning triads

What if we want to ban some triads, but allow others? How might we go about building a simulation that does that? Below is an example. The basic idea is to sample nodes, draw an edge randomly with some probability (i.e. to control density), and then choose to keep or delete that edge, depending on if the triangles that it produces are allowed or not.

# A basic function which prevents the formation of specified triads in a random graph simulation
banning_triads_game = function(n = 100, porm = .05, banned = c(2), sim_max = 1000000, probrecip = .5){
  
  require(igraph) # Ensures igraph is loaded
  
  if(any(c(1) %in% banned)){
    stop("Can't ban 003s") # Stops the simulation if the user tried to bad 003 or 012 triads
  }
  
  num_edges = round(n*(n-1)*porm, 2) # calculates the desired number of edges according to the N and Porm parameters
  
  net = make_empty_graph(n = n, directed = TRUE) # initializes an empty network

  edge_count = 0
  sim_count = 0
  
  while(edge_count < num_edges){ # Begins a loop, which ends once the requisite number of edges is reached. 
    
    # This part samples two nodes, checks whether the two sampled nodes are the same node, and whether an edge is already present in the network between these nodes
    
    uniq = TRUE
    edge_present = TRUE
    while(uniq == TRUE | edge_present == TRUE){
      edge_id = sample(1:n, 2, replace = T)
      uniq = edge_id[1] == edge_id[2]
      reciprocated = sample(c(FALSE, TRUE), 1, prob = c(1-probrecip, probrecip))
      edge_present_1 = are.connected(net, edge_id[1], edge_id[2])
      if (reciprocated){
        edge_present_2 = are.connected(net, edge_id[2], edge_id[1])
        edge_present = edge_present_1|edge_present_2
      } else {
        edge_present = edge_present_1
      }
    }
    
    # Calculates the traid census for the network before adding an edge
    before = triad.census(net)
    net_new = net + edge(edge_id) # Adds in the edge
    if(reciprocated){
      edge_id_rev = edge_id[2:1]
      net_new = net_new + edge(edge_id_rev) # Adds in the edge
    }
    after = triad.census(net_new) # Calculates the triad census again
    triad_diff = after - before # Checks to see how much the triad census changed
    
    if(all(triad_diff[banned] == 0)){
      net = net_new # If the banned triads still aren't observed, then the new network is accepted.
    }
    
    edge_count = ecount(net) # number of edges updated
    sim_count = sim_count + 1 # Simulation count updated
    if(sim_count > sim_max){
      print("Warning: Failed to converge, banned triads may be incompatible") # exits simulation if simulation max count is exceeded
      return(net)
    }
  }
  return(net) # Returns the simulated network
}

Cool - let’s try it. Here, we ban three triad types.

no_cycles = banning_triads_game(banned = c(4,5,7))
triad_census(no_cycles)
##  [1] 137539   2682  20010      0      0      9      0    123      0      0
## [11]   1291      0      2      3      9     32
plot(no_cycles, vertex.size = 2, vertex.label = NA, vertex.color = "tomato", edge.arrow.size = .2)