20 Network Analysis in R

20.1 Basic Concepts and Applications

20.1.1 Basic Concepts about Graph

  • vertex(or node)
  • edge
  • direct or undirect graph This course is using igraph for working with network data and network visualizing.

For creating an igraph object , you can use graph.edgelist() ,but remember convert data to matrix form. The position of the nodes and edges of the graph is not important (for now), what is important is the relationship between each node.

Another function to create an igraph object is graph_from_data_frame(). Two data frames need to be provided - one contains all edges in the network with attributes and one all vertices in the network.

  • V(): returns all vertices in a network.
  • E(): returns all edges.
  • gorder() : returns number of vertices.
  • gsize() : returns number of edges.
# Create an igraph object, first you need to convert data to matrix form
friends.mat <- as.matrix(friends)

# Convert friends matrix to an igraph object
g <- graph.edgelist(friends.mat, directed = FALSE)

# Another way to create igraph object
g1 <- graph_from_data_frame(d = friends.edges, vertices = friends.nodes, directed = FALSE)

# Subset vertices and edges
V(g)
## + 16/16 vertices, named, from 0592cac:
##  [1] Jessie  Sidney  Britt   Donnie  Karl    Berry   Rene    Shayne  Elisha 
## [10] Whitney Odell   Lacy    Eugene  Jude    Rickie  Tommy
E(g)
## + 27/27 edges from 0592cac (vertex names):
##  [1] Jessie --Sidney  Jessie --Britt   Sidney --Britt   Sidney --Donnie 
##  [5] Karl   --Berry   Sidney --Rene    Britt  --Rene    Sidney --Shayne 
##  [9] Sidney --Elisha  Sidney --Whitney Jessie --Whitney Donnie --Odell  
## [13] Sidney --Odell   Rene   --Whitney Donnie --Shayne  Jessie --Lacy   
## [17] Rene   --Lacy    Elisha --Eugene  Eugene --Jude    Berry  --Odell  
## [21] Odell  --Rickie  Karl   --Odell   Britt  --Lacy    Elisha --Jude   
## [25] Whitney--Lacy    Britt  --Whitney Karl   --Tommy

gorder(g) # as as vcount(g)
## [1] 16
gsize(g)
## [1] 27
# Make a very basic plot of the network
plot(g)

20.1.2 Add and Visualizing attributes

Vertices and edges could have attributes.

  • set_vertex_attr() : add attributes to vertices.
  • set_edge_attr() : add attributes to edges.

You can use V() to view attributes of vertices in a dataframe.

When attributes are added, something interesting is about to begin . You can subset vertices and edges by specific conditions or delete them.

# suppose you have three attributes
genders <- friends.nodes$gender
ages <- sample(18:23, size = 16, replace = TRUE)
hours <- friends.edges$hours


# Create new vertex attributes called 'gender' and 'age'
g <- set_vertex_attr(g, "gender", value = genders)
## Warning in vattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length
g <- set_vertex_attr(g, "age", value = ages)

# Create new edge attribute called 'hours'
g <- set_edge_attr(g, 'hours', value = hours)
## Warning in eattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length

# View all vertex and edge attributes  in a list
vertex_attr(g)
## $name
##  [1] "Jessie"  "Sidney"  "Britt"   "Donnie"  "Karl"    "Berry"   "Rene"   
##  [8] "Shayne"  "Elisha"  "Whitney" "Odell"   "Lacy"    "Eugene"  "Jude"   
## [15] "Rickie"  "Tommy"  
## 
## $gender
##  [1] "M" "F" "F" "M" "M" "F" "M" "F" "M" "F" "F" "M" "M" "M" "M" "M"
## 
## $age
##  [1] 21 18 18 21 18 19 21 18 19 21 18 18 18 20 18 18
edge_attr(g)
## $hours
##  [1] 1 3 2 3 5 1 3 5 2 1 3 5 3 2 6 1 2 2 1 1 2 1 1 3 2 1 3

# Find edges with conditions, this grammar is like the way Python takes a subset

# V(g)[[age == 22]] 
# E(g)[[inc('Lacy')]]  
# E(g)[[hours>=3]]  

# Delete edges and vertices
g1_1 <- delete_edges(g1, E(g1)[hours < 2])
g1_2 <- delete_vertices(g, V(g)[[age <20]])
# Visualizing attributes
paste("This graph tells me that boys prefer to make friends and their relationship graph is more aggregated")
## [1] "This graph tells me that boys prefer to make friends and their relationship graph is more aggregated"
# Set vertex color by gender
V(g1)$color <- ifelse(V(g1)$gender == "F", "orange", "#24addb")

# Plot the graph
plot(g1, vertex.label.color = "black")

20.1.3 Network Visualization

The most effective network visualizations should immediately provide insight and understanding to the viewer. These styles below can be done separately or in conjunction with each other.

Vertices Style - size - labels - color - shape

Edges Style - color - line type


Appropriate Layout

  • Minimize edge crossing
  • Do not allow vertices to overlap
  • Make edge lengths as uniform as possible
  • Increase symmetry of the network as much as possible
  • Position more influential nodes towards the center

In the plot() function, change the layout argument to produce different graph layout. Choosing a correct layout can be bewildering. Fortunately igraph has a function layout_nicely() that tries to choose the most appropriate layout function for a given graph object.

  • circle
    • layout_in_circle()
  • fruchterman-reingold:
  • kamada-kawai:
    • layout_with_fr()
  • grid:
    • layout_on_grid()
  • lgl:
    • layout_with_lgl()
  • tree:
    • layout_as_tree()

(For more layouts see the manual on igraph website)

You can also stipulate the layout by providing a matrix of (x, y) coordinates for each vertex. Here you use the layout_as_tree() function to generate the matrix m of coordinates. Then pass m to the layout function in plot() to plot.

# Visualization in layout
plot(g1, vertex.label.color = "black", layout = layout_in_circle(g1))

plot(g1, vertex.label.color = "black", layout = layout_with_fr(g1))

m <- layout_as_tree(g1)
plot(g1, vertex.label.color = "black", layout =  m)

m1 <- layout_nicely(g1)
plot(g1, vertex.label.color = "black", layout = m1)

# Visualiztion in edges

# Create a vector of weights based on the number of hours each pair spend together
w1 <- E(g1)$hours

# Plot the network varying edges by weights
m1 <- layout_nicely(g1)
plot(g1, 
        vertex.label.color = "black", 
        edge.color = 'black',
        edge.width = w1,
        layout = m1)

20.2 Directed networks

20.2.1 Basic Concepts

  • head -> tail

The edges in undirected networks simply indicate that a relationship of some kind exists between two vertices. However, in many networks edges do have directionality. In such directed networks, an arrow represent an edge going from one vertex to another vertex.

Subset each vertex from which each edge originates by using head_of(). This function takes two arguments, the first being the graph object and the second the edges to examine. For all edges you can use E(g)

The example network you will use is of a severe measles outbreak that spread through the town of Hagelloch, Germany, in 1861, affecting 188 children. Each edge represents the transmission of measles from one individual to the next.

# Get the graph object
g2 <- graph_from_data_frame(measles, directed = TRUE)

# is the graph directed?
is.directed(g2)
## [1] TRUE
# Where does each edge originate from? But I don't understand the result ...
# table(head_of(g2, E(g2)))

plot(g2, 
     vertex.label.color = "black", 
     edge.color = 'gray77',
     vertex.size = 0,
     edge.arrow.size = 0.1,
     layout = layout_nicely(g2))

20.2.2 Relationships between edges

Check if there is an edge going in each direction between vertex 184 to vertex 178 using single brackets subsetting of the graph object.

Using the function incident() identify those edges that go in either direction from vertex 184 or only those going out from vertex 184. Choose mode from all, in and out.

# Is there an edge going from vertex 184 to vertex 178?
g2['184', '178']
## [1] 1

# Is there an edge going from vertex 178 to vertex 184?
g2['178', '184']
## [1] 0

# Show all edges going to or from vertex 184
incident(g2, '184', mode = c("all"))
## + 6/184 edges from 061a724 (vertex names):
## [1] 184->45  184->182 184->181 184->178 184->183 184->177

# Show all edges going out from vertex 184
incident(g2, '184', mode = c("out"))
## + 6/184 edges from 061a724 (vertex names):
## [1] 184->45  184->182 184->181 184->178 184->183 184->177

20.2.3 Relationships between vertices

  • neighbors(): A vertex is a neighbor of another one (in other words, the two vertices are adjacent), if they are incident to the same edge.
    • intersection(): to identify if there are any vertices that exist in both neighbors1 and neighbors1.

A path is simply the chain of connections between vertices. The number of intervening edges between two vertices represents the geodesic distance between vertices.

  • farthest_vertices(): Find the length of the longest path in the network(Sometimes it can be more than one).
  • get_diameter() : Identify the sequence of the path.
  • ego(): Find the vertices not farther than a given limit from another fixed vertex, these are called the neighborhood of the vertex.
# Identify neighbors of vertex 10 in specific mode
neighbors(g2, '10', mode = c('all'))
## + 5/187 vertices, named, from 061a724:
## [1] 45 27 48 71 95

neighbors(g2, '10', mode = c('in'))
## + 1/187 vertex, named, from 061a724:
## [1] 45

# Identify any vertices that receive an edge from vertex 42 and direct an edge to vertex 124
n1 <- neighbors(g2, '42', mode = c('out'))
n2 <- neighbors(g2, '124', mode = c('in'))
intersection(n1, n2)
## + 1/187 vertex, named, from 061a724:
## [1] 7


# distance between two vertices
farthest_vertices(g2)
## $vertices
## + 2/187 vertices, named, from 061a724:
## [1] 184 162
## 
## $distance
## [1] 5

get_diameter(g2)
## + 6/187 vertices, named, from 061a724:
## [1] 184 178 42  7   123 162

ego(g2, 2, '10', mode = c('out'))
## [[1]]
## + 5/187 vertices, named, from 061a724:
## [1] 10 27 48 71 95

20.2.4 Identifying key vertices

Measurement metrics: - in degree - out degree - Betweenness

Betweenness is an index of how frequently the vertex lies on shortest paths between any two vertices in the network. It can be thought of as how critical the vertex is to the flow of information through a network.

# Calculate the out-degree of each vertex
g.outd <- degree(g2, mode = c("out"))


table(g.outd)
## g.outd
##   0   1   2   3   4   6   7   8  30 
## 125  21  16  12   6   2   3   1   1

# Make a histogram of out-degrees
hist(g.outd, breaks = 30)


# Find the vertex that has the maximum out-degree
which.max(g.outd)
## 45 
##  1

plot(g2, 
     vertex.label = NA,
     edge.color = 'black',
     vertex.size = sqrt(g.outd)+1,
     edge.arrow.size = 0.05,
     layout = layout_nicely(g2))


# Calculate betweenness of each vertex
g.b <- betweenness(g2, directed = TRUE)

hist(g.b, breaks = 80)


# Create plot with vertex size determined by betweenness score
plot(g2, 
     vertex.label = NA,
     edge.color = 'black',
     vertex.size = sqrt(g.b)+1,
     edge.arrow.size = 0.05,
     layout = layout_nicely(g2))

make_ego_graph() : Create a subset of our network comprised of vertices that are connected to fixed vertex , the first argument is the original graph g, the second argument is the maximal number of connections that any vertex needs to be connected to our vertex of interest.

One issue with the measles dataset is that there are three individuals for whom no information is known about who infected them. One of these individuals (vertex 184) appears ultimately responsible for spreading the disease to many other individuals even though they did not directly infect too many individuals. (This entire paragraph is excerpted from the course)

However, because vertex 184 has no incoming edge in the network they appear to have low betweenness. One way to explore the importance of this vertex is by visualizing the geodesic distances of connections going out from this individual. In this exercise you shall create a plot of these distances from this patient zero. (This entire paragraph is excerpted from the course)

# Make an ego graph
g184 <- make_ego_graph(g2, diameter(g2), nodes = '184', mode = c("all"))[[1]]

# Get a vector of geodesic distances of all vertices from vertex 184 
dists <- distances(g184, "184")

# Create a color palette of length equal to the maximal geodesic distance plus one.
colors <- c("black", "red", "orange", "blue", "dodgerblue", "cyan")

# Set color attribute to vertices of network g184.
V(g184)$color <- colors[dists+1]

# Visualize the network based on geodesic distance from vertex 184 (patient zero).
plot(g184, 
     vertex.label = dists, 
     vertex.label.color = "white",
     vertex.label.cex = .6,
     edge.color = 'black',
     vertex.size = 7,
     edge.arrow.size = .05,
     main = "Geodesic Distances from Patient Zero"
     )

20.3 Network Structure

20.3.1 Basic Structure metrics

You will use a social network based on the movie Forrest Gump. Each edge of the network indicates that those two characters were in at least one scene of the movie together. Therefore this network is undirected.

# Make an undirected network
g3 <- graph_from_data_frame(gump, directed = FALSE)

# Identify key nodes using eigenvector centrality
g.ec <- eigen_centrality(g3)
which.max(g.ec$vector)
## FORREST 
##      36

# Plot Forrest Gump Network
plot(g3,
vertex.label.color = "black", 
vertex.label.cex = 0.6,
vertex.size = 25*(g.ec$vector),
edge.color = 'gray88',
main = "Forrest Gump Network"
)

  • edge_density() : calculate the edge density of the graph
  • mean_distance() : calculate the average path length of the graph
# Get density of a graph
gd <- edge_density(g3)

# The diameter of a graph is the length of the longest geodesic.
diameter(g3, directed = FALSE)
## [1] 4

# Get the average path length of the graph g
g.apl <- mean_distance(g3, directed = FALSE)
g.apl
## [1] 1.994967

20.3.2 Random graphs

Generating random graphs is an important method for investigating how likely or unlikely other network metrics are likely to occur given certain properties of the original graph.

The simplest random graph is one that has the same number of vertices as your original graph and approximately the same density as the original graph.(This entire paragraph is excerpted from the course)

erdos.renyi.game() : generate random graphs according to the Erdos-Renyi model, more function details on igraph.

more infomation in random graph

g.random <- erdos.renyi.game(n = gorder(g3), p.or.m = gd, type = "gnp")

plot(g.random, 
    vertex.label = NA,
    vertex.size = 1,
    )


# Get density of new random graph `g.random`
edge_density(g.random)
## [1] 0.06382979

# Get the average path length of the random graph g.random
mean_distance(g.random, directed = FALSE)
## [1] 2.703729

20.3.3 Randomization test

The Forrest Gump network was smaller than the average path length of the random network, you can more formally address this by creating 1000 random graphs based on the number of vertices and density of the original Forrest Gump graph. Then, you can see how many times the average path length of the random graphs is less than the original Forrest Gump network. This is called a randomization test.(This entire paragraph is excerpted from the course)

Conclusion : the Forrest Gump network is far more interconnected than we would expect by chance.

# Generate 1000 random graphs
gl <- vector('list', 1000)
  
for(i in 1:1000){
  gl[[i]] <- erdos.renyi.game(n = gorder(g3), p.or.m = gd, type = "gnp")
}

# Calculate average path length of 1000 random graphs
gl.apls <- unlist(lapply(gl, mean_distance, directed = FALSE))

# Plot the distribution of average path lengths
hist(gl.apls, xlim = range(c(1.5, 6)), breaks = 5)
abline(v = mean_distance(g3,directed=FALSE), col = "red", lty = 3, lwd = 2)


# Calculate the proportion of graphs with an average path length lower than our observed
mean(gl.apls < mean_distance(g3,directed = FALSE))
## [1] 0

20.3.4 Substructures’ metrics

  • triangles() : find all closed triangles in the graph

    • count_triangles() : count how many triangles a vertex is part of, in a graph, or just list the triangles of a graph.
  • transitivity() : transitivity measures the probability that the adjacent vertices of a vertex are connected. You can use this function to get the global or local transitivity. Also, you can use randomization test to investigate if the global transitivity of the Forrest Gump network is significantly higher.(Conclusion : Yes)

  • Clique : In a clique, every vertex is connected to every other vertex.

    • largest_cliques() : identify the largest cliques in the network
    • max_cliques() : determine all the maximal cliques in the network
# Show all triangles in the network. 
matrix(triangles(g3), nrow = 3)[1:3,1:3]
##      [,1] [,2] [,3]
## [1,]   36   36   36
## [2,]    1    1    1
## [3,]   83   38   39

# Count the number of triangles that vertex "BUBBA" is in.
count_triangles(g3, vids='BUBBA')
## [1] 37

# Calculate  the global transitivity of the network.
g.tr <- transitivity(g3)
g.tr
## [1] 0.1918082

# Calculate the local transitivity for vertex BOB HOPE.
transitivity(g3, vids='BOB HOPE', type = "local")
## [1] 1


gl.tr <- lapply(gl, transitivity)
gl.trs <- unlist(gl.tr)


# Calculate the proportion of graphs with a transitivity score higher than Forrest Gump's network
mean(gl.trs > g.tr)
## [1] 0

largest_cliques(g3)
## [[1]]
## + 9/94 vertices, named, from 0756b10:
## [1] EMCEE   FORREST MEN     MAN #5  MAN #3  MAN #2  MAN #1  MAN #   JENNY  
## 
## [[2]]
## + 9/94 vertices, named, from 0756b10:
## [1] FORREST   LT DAN    STRONGARM SONG      SOLDIER   SGT SIMS  MAN      
## [8] DALLAS    BUBBA

clq <- max_cliques(g3) # some confusing in function name

# Calculate the size of each maximal clique.
table(unlist(lapply(clq, length)))
## 
##  2  3  4  5  6  7  9 
## 12 24  7  2  4  2  2

20.3.5 Visualize largest cliques

Steps are: - You need get the largest cliques(may not one) - Subset from the original graph - Plot it with circle layout

Then you can change some arguments in plot function, I want to make them simple .

lc <- largest_cliques(g3)

# Create two new undirected subgraphs, each containing only the vertices of each largest clique.
gs1 <- as.undirected(subgraph(g3, lc[[1]]))
gs2 <- as.undirected(subgraph(g3, lc[[2]]))

par(mfrow=c(1,2)) # To plot two plots side-by-side

plot(gs1,
     vertex.label = NA, 
     vertex.size = 0,
     edge.color = 'gray28',
     #main = "Largest Clique 1",
     layout = layout.circle(gs1)
)

plot(gs2,
     vertex.label = NA, 
     #vertex.label.color = "black", 
     #vertex.label.cex = 0.9,
     vertex.size = 0,
     edge.color = 'gray28',
     #main = "Largest Clique 2",
     layout = layout.circle(gs2)
)

20.4 Do birds of a feather flock together?

  • Assortativity
    • assortativity() : calculate the assortativity based on specific attribute .The assortativity coefficient is positive is similar vertices (based on some external property) tend to connect to each, and negative otherwise.
    • assortativity.degree() : calculate the degree assortativity of the network .
  • Reciprocity
    • reciprocity() : the reciprocity of a directed network reflects the proportion of edges that are symmetrical.

You can also do randomization test to these two metrics.

# Convert the gender attribute into a numeric value
values <- as.numeric(factor(V(g1)$gender))

# Calculate the assortativity of the network based on gender
assortativity(g1, values)
## [1] 0.1319444

# Calculate the assortativity degree of the network
assortativity.degree(g1, directed = FALSE)
## [1] 0.4615385

# Calculate the reciprocity of the direct graph
reciprocity(g2)
## [1] 0

20.4.1 Community detection

If a network is said to have a community structure then it is possible to assign vertices to unique sets.

  • fastgreedy.community() : a community detection method .
  • edge.betweenness.community() : an alternative community detection method.
  • sizes() : determine how many communities were detected and how many club members are in each.
  • membership() : display which club members are in which community.
  • disjoint_union(): the union of two or more graphs are created. The graphs are assumed to have disjoint vertex sets. Notice:
  • x %du% y is a simplified form of disjoint_union(...), and this example is from igraph document.
  • Because the division of this figure is so obvious, there is no difference between the two methods
# Create a graph
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)

g <- add_edges(g, c(1,6, 1,11, 6, 11))

# make community detection
fc <- fastgreedy.community(g)
gc <- edge.betweenness.community(g)

# view metrics
sizes(fc)
## Community sizes
## 1 2 3 
## 5 5 5
membership(fc)
##  [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
sizes(gc)
## Community sizes
## 1 2 3 
## 5 5 5

# Because this graph is
par(mfrow = c(1, 2)) 
plot(fc, g)
plot(gc, g)

20.4.2 Interactive network visualizations

There are many other R packages that also make network visualizations.

igraph visNetwork
statnet networkD3
ggnet sigma
ggnetwork rgexf(igraph to Gephi)
ggraph threejs

The threejs package is especially useful for visualizing large networks and integrates very well with igraph. And you can use it to create interactive network visualization.

graphjs(g1, vertex.size = 1)
# Another unrelated but unprecedented operation that I have not encountered before
i <-  membership(fc)
c("yellow", "blue", "red")[i]
##  [1] "red"    "red"    "red"    "red"    "red"    "yellow" "yellow" "yellow"
##  [9] "yellow" "yellow" "blue"   "blue"   "blue"   "blue"   "blue"