20 Network Analysis in R
20.1 Basic Concepts and Applications
20.1.1 Basic Concepts about Graph
- vertex(or node)
- edge
- direct or undirect graph This course is using igraph for working with network data and network visualizing.
For creating an igraph object , you can use graph.edgelist()
,but remember convert data to matrix form. The position of the nodes and edges of the graph is not important (for now), what is important is the relationship between each node.
Another function to create an igraph object is graph_from_data_frame()
. Two data frames need to be provided - one contains all edges in the network with attributes and one all vertices in the network.
V()
: returns all vertices in a network.E()
: returns all edges.gorder()
: returns number of vertices.gsize()
: returns number of edges.
# Create an igraph object, first you need to convert data to matrix form
<- as.matrix(friends)
friends.mat
# Convert friends matrix to an igraph object
<- graph.edgelist(friends.mat, directed = FALSE)
g
# Another way to create igraph object
<- graph_from_data_frame(d = friends.edges, vertices = friends.nodes, directed = FALSE)
g1
# Subset vertices and edges
V(g)
## + 16/16 vertices, named, from 0592cac:
## [1] Jessie Sidney Britt Donnie Karl Berry Rene Shayne Elisha
## [10] Whitney Odell Lacy Eugene Jude Rickie Tommy
E(g)
## + 27/27 edges from 0592cac (vertex names):
## [1] Jessie --Sidney Jessie --Britt Sidney --Britt Sidney --Donnie
## [5] Karl --Berry Sidney --Rene Britt --Rene Sidney --Shayne
## [9] Sidney --Elisha Sidney --Whitney Jessie --Whitney Donnie --Odell
## [13] Sidney --Odell Rene --Whitney Donnie --Shayne Jessie --Lacy
## [17] Rene --Lacy Elisha --Eugene Eugene --Jude Berry --Odell
## [21] Odell --Rickie Karl --Odell Britt --Lacy Elisha --Jude
## [25] Whitney--Lacy Britt --Whitney Karl --Tommy
gorder(g) # as as vcount(g)
## [1] 16
gsize(g)
## [1] 27
# Make a very basic plot of the network
plot(g)
20.1.2 Add and Visualizing attributes
Vertices and edges could have attributes.
set_vertex_attr()
: add attributes to vertices.set_edge_attr()
: add attributes to edges.
You can use V()
to view attributes of vertices in a dataframe.
When attributes are added, something interesting is about to begin . You can subset vertices and edges by specific conditions or delete them.
# suppose you have three attributes
<- friends.nodes$gender
genders <- sample(18:23, size = 16, replace = TRUE)
ages <- friends.edges$hours
hours
# Create new vertex attributes called 'gender' and 'age'
<- set_vertex_attr(g, "gender", value = genders)
g ## Warning in vattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length
<- set_vertex_attr(g, "age", value = ages)
g
# Create new edge attribute called 'hours'
<- set_edge_attr(g, 'hours', value = hours)
g ## Warning in eattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length
# View all vertex and edge attributes in a list
vertex_attr(g)
## $name
## [1] "Jessie" "Sidney" "Britt" "Donnie" "Karl" "Berry" "Rene"
## [8] "Shayne" "Elisha" "Whitney" "Odell" "Lacy" "Eugene" "Jude"
## [15] "Rickie" "Tommy"
##
## $gender
## [1] "M" "F" "F" "M" "M" "F" "M" "F" "M" "F" "F" "M" "M" "M" "M" "M"
##
## $age
## [1] 21 18 18 21 18 19 21 18 19 21 18 18 18 20 18 18
edge_attr(g)
## $hours
## [1] 1 3 2 3 5 1 3 5 2 1 3 5 3 2 6 1 2 2 1 1 2 1 1 3 2 1 3
# Find edges with conditions, this grammar is like the way Python takes a subset
# V(g)[[age == 22]]
# E(g)[[inc('Lacy')]]
# E(g)[[hours>=3]]
# Delete edges and vertices
<- delete_edges(g1, E(g1)[hours < 2])
g1_1 <- delete_vertices(g, V(g)[[age <20]]) g1_2
# Visualizing attributes
paste("This graph tells me that boys prefer to make friends and their relationship graph is more aggregated")
## [1] "This graph tells me that boys prefer to make friends and their relationship graph is more aggregated"
# Set vertex color by gender
V(g1)$color <- ifelse(V(g1)$gender == "F", "orange", "#24addb")
# Plot the graph
plot(g1, vertex.label.color = "black")
20.1.3 Network Visualization
The most effective network visualizations should immediately provide insight and understanding to the viewer. These styles below can be done separately or in conjunction with each other.
Vertices Style - size - labels - color - shape
Edges Style - color - line type
Appropriate Layout
- Minimize edge crossing
- Do not allow vertices to overlap
- Make edge lengths as uniform as possible
- Increase symmetry of the network as much as possible
- Position more influential nodes towards the center
In the plot()
function, change the layout argument to produce different graph layout. Choosing a correct layout can be bewildering. Fortunately igraph has a function layout_nicely()
that tries to choose the most appropriate layout function for a given graph object.
- circle
layout_in_circle()
- fruchterman-reingold:
layout_with_fr()
, details on Wiki
- kamada-kawai:
layout_with_fr()
- grid:
layout_on_grid()
- lgl:
layout_with_lgl()
- tree:
layout_as_tree()
(For more layouts see the manual on igraph website)
You can also stipulate the layout by providing a matrix of (x, y) coordinates for each vertex.
Here you use the layout_as_tree()
function to generate the matrix m of coordinates. Then pass m to the layout function in plot() to plot.
# Visualization in layout
plot(g1, vertex.label.color = "black", layout = layout_in_circle(g1))
plot(g1, vertex.label.color = "black", layout = layout_with_fr(g1))
<- layout_as_tree(g1)
m plot(g1, vertex.label.color = "black", layout = m)
<- layout_nicely(g1)
m1 plot(g1, vertex.label.color = "black", layout = m1)
# Visualiztion in edges
# Create a vector of weights based on the number of hours each pair spend together
<- E(g1)$hours
w1
# Plot the network varying edges by weights
<- layout_nicely(g1)
m1 plot(g1,
vertex.label.color = "black",
edge.color = 'black',
edge.width = w1,
layout = m1)
20.2 Directed networks
20.2.1 Basic Concepts
- head -> tail
The edges in undirected networks simply indicate that a relationship of some kind exists between two vertices. However, in many networks edges do have directionality. In such directed networks, an arrow represent an edge going from one vertex to another vertex.
Subset each vertex from which each edge originates by using head_of()
. This function takes two arguments, the first being the graph object and the second the edges to examine. For all edges you can use E(g)
The example network you will use is of a severe measles outbreak that spread through the town of Hagelloch, Germany, in 1861, affecting 188 children. Each edge represents the transmission of measles from one individual to the next.
# Get the graph object
<- graph_from_data_frame(measles, directed = TRUE)
g2
# is the graph directed?
is.directed(g2)
## [1] TRUE
# Where does each edge originate from? But I don't understand the result ...
# table(head_of(g2, E(g2)))
plot(g2,
vertex.label.color = "black",
edge.color = 'gray77',
vertex.size = 0,
edge.arrow.size = 0.1,
layout = layout_nicely(g2))
20.2.2 Relationships between edges
Check if there is an edge going in each direction between vertex 184 to vertex 178 using single brackets subsetting of the graph object.
Using the function incident()
identify those edges that go in either direction from vertex 184 or only those going out from vertex 184. Choose mode
from all, in and out.
# Is there an edge going from vertex 184 to vertex 178?
'184', '178']
g2[## [1] 1
# Is there an edge going from vertex 178 to vertex 184?
'178', '184']
g2[## [1] 0
# Show all edges going to or from vertex 184
incident(g2, '184', mode = c("all"))
## + 6/184 edges from 061a724 (vertex names):
## [1] 184->45 184->182 184->181 184->178 184->183 184->177
# Show all edges going out from vertex 184
incident(g2, '184', mode = c("out"))
## + 6/184 edges from 061a724 (vertex names):
## [1] 184->45 184->182 184->181 184->178 184->183 184->177
20.2.3 Relationships between vertices
neighbors()
: A vertex is a neighbor of another one (in other words, the two vertices are adjacent), if they are incident to the same edge.intersection()
: to identify if there are any vertices that exist in both neighbors1 and neighbors1.
A path is simply the chain of connections between vertices. The number of intervening edges between two vertices represents the geodesic distance between vertices.
farthest_vertices()
: Find the length of the longest path in the network(Sometimes it can be more than one).get_diameter()
: Identify the sequence of the path.ego()
: Find the vertices not farther than a given limit from another fixed vertex, these are called the neighborhood of the vertex.
# Identify neighbors of vertex 10 in specific mode
neighbors(g2, '10', mode = c('all'))
## + 5/187 vertices, named, from 061a724:
## [1] 45 27 48 71 95
neighbors(g2, '10', mode = c('in'))
## + 1/187 vertex, named, from 061a724:
## [1] 45
# Identify any vertices that receive an edge from vertex 42 and direct an edge to vertex 124
<- neighbors(g2, '42', mode = c('out'))
n1 <- neighbors(g2, '124', mode = c('in'))
n2 intersection(n1, n2)
## + 1/187 vertex, named, from 061a724:
## [1] 7
# distance between two vertices
farthest_vertices(g2)
## $vertices
## + 2/187 vertices, named, from 061a724:
## [1] 184 162
##
## $distance
## [1] 5
get_diameter(g2)
## + 6/187 vertices, named, from 061a724:
## [1] 184 178 42 7 123 162
ego(g2, 2, '10', mode = c('out'))
## [[1]]
## + 5/187 vertices, named, from 061a724:
## [1] 10 27 48 71 95
20.2.4 Identifying key vertices
Measurement metrics: - in degree - out degree - Betweenness
Betweenness is an index of how frequently the vertex lies on shortest paths between any two vertices in the network. It can be thought of as how critical the vertex is to the flow of information through a network.
# Calculate the out-degree of each vertex
<- degree(g2, mode = c("out"))
g.outd
table(g.outd)
## g.outd
## 0 1 2 3 4 6 7 8 30
## 125 21 16 12 6 2 3 1 1
# Make a histogram of out-degrees
hist(g.outd, breaks = 30)
# Find the vertex that has the maximum out-degree
which.max(g.outd)
## 45
## 1
plot(g2,
vertex.label = NA,
edge.color = 'black',
vertex.size = sqrt(g.outd)+1,
edge.arrow.size = 0.05,
layout = layout_nicely(g2))
# Calculate betweenness of each vertex
<- betweenness(g2, directed = TRUE)
g.b
hist(g.b, breaks = 80)
# Create plot with vertex size determined by betweenness score
plot(g2,
vertex.label = NA,
edge.color = 'black',
vertex.size = sqrt(g.b)+1,
edge.arrow.size = 0.05,
layout = layout_nicely(g2))
make_ego_graph()
: Create a subset of our network comprised of vertices that are connected to fixed vertex , the first argument is the original graph g, the second argument is the maximal number of connections that any vertex needs to be connected to our vertex of interest.
One issue with the measles dataset is that there are three individuals for whom no information is known about who infected them. One of these individuals (vertex 184) appears ultimately responsible for spreading the disease to many other individuals even though they did not directly infect too many individuals. (This entire paragraph is excerpted from the course)
However, because vertex 184 has no incoming edge in the network they appear to have low betweenness. One way to explore the importance of this vertex is by visualizing the geodesic distances of connections going out from this individual. In this exercise you shall create a plot of these distances from this patient zero. (This entire paragraph is excerpted from the course)
# Make an ego graph
<- make_ego_graph(g2, diameter(g2), nodes = '184', mode = c("all"))[[1]]
g184
# Get a vector of geodesic distances of all vertices from vertex 184
<- distances(g184, "184")
dists
# Create a color palette of length equal to the maximal geodesic distance plus one.
<- c("black", "red", "orange", "blue", "dodgerblue", "cyan")
colors
# Set color attribute to vertices of network g184.
V(g184)$color <- colors[dists+1]
# Visualize the network based on geodesic distance from vertex 184 (patient zero).
plot(g184,
vertex.label = dists,
vertex.label.color = "white",
vertex.label.cex = .6,
edge.color = 'black',
vertex.size = 7,
edge.arrow.size = .05,
main = "Geodesic Distances from Patient Zero"
)
20.3 Network Structure
20.3.1 Basic Structure metrics
eigen_centrality()
: takes a graph (graph) and returns the eigenvector centralities of positions v within it
You will use a social network based on the movie Forrest Gump. Each edge of the network indicates that those two characters were in at least one scene of the movie together. Therefore this network is undirected.
# Make an undirected network
<- graph_from_data_frame(gump, directed = FALSE)
g3
# Identify key nodes using eigenvector centrality
<- eigen_centrality(g3)
g.ec which.max(g.ec$vector)
## FORREST
## 36
# Plot Forrest Gump Network
plot(g3,
vertex.label.color = "black",
vertex.label.cex = 0.6,
vertex.size = 25*(g.ec$vector),
edge.color = 'gray88',
main = "Forrest Gump Network"
)
edge_density()
: calculate the edge density of the graphmean_distance()
: calculate the average path length of the graph
# Get density of a graph
<- edge_density(g3)
gd
# The diameter of a graph is the length of the longest geodesic.
diameter(g3, directed = FALSE)
## [1] 4
# Get the average path length of the graph g
<- mean_distance(g3, directed = FALSE)
g.apl
g.apl## [1] 1.994967
20.3.2 Random graphs
Generating random graphs is an important method for investigating how likely or unlikely other network metrics are likely to occur given certain properties of the original graph.
The simplest random graph is one that has the same number of vertices as your original graph and approximately the same density as the original graph.(This entire paragraph is excerpted from the course)
erdos.renyi.game()
: generate random graphs according to the Erdos-Renyi model, more function details on igraph.
more infomation in random graph
<- erdos.renyi.game(n = gorder(g3), p.or.m = gd, type = "gnp")
g.random
plot(g.random,
vertex.label = NA,
vertex.size = 1,
)
# Get density of new random graph `g.random`
edge_density(g.random)
## [1] 0.06382979
# Get the average path length of the random graph g.random
mean_distance(g.random, directed = FALSE)
## [1] 2.703729
20.3.3 Randomization test
The Forrest Gump network was smaller than the average path length of the random network, you can more formally address this by creating 1000 random graphs based on the number of vertices and density of the original Forrest Gump graph. Then, you can see how many times the average path length of the random graphs is less than the original Forrest Gump network. This is called a randomization test.(This entire paragraph is excerpted from the course)
Conclusion : the Forrest Gump network is far more interconnected than we would expect by chance.
# Generate 1000 random graphs
<- vector('list', 1000)
gl
for(i in 1:1000){
<- erdos.renyi.game(n = gorder(g3), p.or.m = gd, type = "gnp")
gl[[i]]
}
# Calculate average path length of 1000 random graphs
<- unlist(lapply(gl, mean_distance, directed = FALSE))
gl.apls
# Plot the distribution of average path lengths
hist(gl.apls, xlim = range(c(1.5, 6)), breaks = 5)
abline(v = mean_distance(g3,directed=FALSE), col = "red", lty = 3, lwd = 2)
# Calculate the proportion of graphs with an average path length lower than our observed
mean(gl.apls < mean_distance(g3,directed = FALSE))
## [1] 0
20.3.4 Substructures’ metrics
triangles()
: find all closed triangles in the graphcount_triangles()
: count how many triangles a vertex is part of, in a graph, or just list the triangles of a graph.
transitivity()
: transitivity measures the probability that the adjacent vertices of a vertex are connected. You can use this function to get the global or local transitivity. Also, you can use randomization test to investigate if the global transitivity of the Forrest Gump network is significantly higher.(Conclusion : Yes)Clique : In a clique, every vertex is connected to every other vertex.
largest_cliques()
: identify the largest cliques in the networkmax_cliques()
: determine all the maximal cliques in the network
# Show all triangles in the network.
matrix(triangles(g3), nrow = 3)[1:3,1:3]
## [,1] [,2] [,3]
## [1,] 36 36 36
## [2,] 1 1 1
## [3,] 83 38 39
# Count the number of triangles that vertex "BUBBA" is in.
count_triangles(g3, vids='BUBBA')
## [1] 37
# Calculate the global transitivity of the network.
<- transitivity(g3)
g.tr
g.tr## [1] 0.1918082
# Calculate the local transitivity for vertex BOB HOPE.
transitivity(g3, vids='BOB HOPE', type = "local")
## [1] 1
<- lapply(gl, transitivity)
gl.tr <- unlist(gl.tr)
gl.trs
# Calculate the proportion of graphs with a transitivity score higher than Forrest Gump's network
mean(gl.trs > g.tr)
## [1] 0
largest_cliques(g3)
## [[1]]
## + 9/94 vertices, named, from 0756b10:
## [1] EMCEE FORREST MEN MAN #5 MAN #3 MAN #2 MAN #1 MAN # JENNY
##
## [[2]]
## + 9/94 vertices, named, from 0756b10:
## [1] FORREST LT DAN STRONGARM SONG SOLDIER SGT SIMS MAN
## [8] DALLAS BUBBA
<- max_cliques(g3) # some confusing in function name
clq
# Calculate the size of each maximal clique.
table(unlist(lapply(clq, length)))
##
## 2 3 4 5 6 7 9
## 12 24 7 2 4 2 2
20.3.5 Visualize largest cliques
Steps are: - You need get the largest cliques(may not one) - Subset from the original graph - Plot it with circle layout
Then you can change some arguments in plot function, I want to make them simple .
<- largest_cliques(g3)
lc
# Create two new undirected subgraphs, each containing only the vertices of each largest clique.
<- as.undirected(subgraph(g3, lc[[1]]))
gs1 <- as.undirected(subgraph(g3, lc[[2]]))
gs2
par(mfrow=c(1,2)) # To plot two plots side-by-side
plot(gs1,
vertex.label = NA,
vertex.size = 0,
edge.color = 'gray28',
#main = "Largest Clique 1",
layout = layout.circle(gs1)
)
plot(gs2,
vertex.label = NA,
#vertex.label.color = "black",
#vertex.label.cex = 0.9,
vertex.size = 0,
edge.color = 'gray28',
#main = "Largest Clique 2",
layout = layout.circle(gs2)
)
20.4 Do birds of a feather flock together?
- Assortativity
assortativity()
: calculate the assortativity based on specific attribute .The assortativity coefficient is positive is similar vertices (based on some external property) tend to connect to each, and negative otherwise.assortativity.degree()
: calculate the degree assortativity of the network .
- Reciprocity
reciprocity()
: the reciprocity of a directed network reflects the proportion of edges that are symmetrical.
You can also do randomization test to these two metrics.
# Convert the gender attribute into a numeric value
<- as.numeric(factor(V(g1)$gender))
values
# Calculate the assortativity of the network based on gender
assortativity(g1, values)
## [1] 0.1319444
# Calculate the assortativity degree of the network
assortativity.degree(g1, directed = FALSE)
## [1] 0.4615385
# Calculate the reciprocity of the direct graph
reciprocity(g2)
## [1] 0
20.4.1 Community detection
If a network is said to have a community structure then it is possible to assign vertices to unique sets.
fastgreedy.community()
: a community detection method .edge.betweenness.community()
: an alternative community detection method.sizes()
: determine how many communities were detected and how many club members are in each.membership()
: display which club members are in which community.disjoint_union()
: the union of two or more graphs are created. The graphs are assumed to have disjoint vertex sets. Notice:x %du% y
is a simplified form ofdisjoint_union(...)
, and this example is from igraph document.- Because the division of this figure is so obvious, there is no difference between the two methods
# Create a graph
<- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g
<- add_edges(g, c(1,6, 1,11, 6, 11))
g
# make community detection
<- fastgreedy.community(g)
fc <- edge.betweenness.community(g)
gc
# view metrics
sizes(fc)
## Community sizes
## 1 2 3
## 5 5 5
membership(fc)
## [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
sizes(gc)
## Community sizes
## 1 2 3
## 5 5 5
# Because this graph is
par(mfrow = c(1, 2))
plot(fc, g)
plot(gc, g)
20.4.2 Interactive network visualizations
There are many other R packages that also make network visualizations.
igraph | visNetwork |
statnet | networkD3 |
ggnet | sigma |
ggnetwork | rgexf(igraph to Gephi) |
ggraph | threejs |
The threejs package is especially useful for visualizing large networks and integrates very well with igraph. And you can use it to create interactive network visualization.
graphjs(g1, vertex.size = 1)
# Another unrelated but unprecedented operation that I have not encountered before
<- membership(fc)
i c("yellow", "blue", "red")[i]
## [1] "red" "red" "red" "red" "red" "yellow" "yellow" "yellow"
## [9] "yellow" "yellow" "blue" "blue" "blue" "blue" "blue"