# 3 Tutorial probabilistic modeling with Bayesian networks and bnlearn

Lecture notes by Sara Taheri

## 3.1 Installing bnlearn

Open RStudio and in console type:

install.packages(bnlearn)
install.packages(Rgraphviz)

If you experience problems installing Rgraphviz, try the following script:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rgraphviz")

Then type “bnlearn” in the window that appears and click on the install button. Do the same thing for the other package.

## 3.2 Understanding the directed acyclic graph representation

In this part, we introduce survey data set and show how we can visualize it with bnlearn package.

### 3.2.1 The survey data dataset

Survey data is a data set that contains information about usage of different transportation systems with a focus on cars and trains for different social groups. It includes these factors:

• Age (A): It is recorded as young (young) for individuals below 30 years, adult (adult) for individuals between 30 and 60 years old, and old (old) for people older than 60.

• Sex (S): The biological sex of individual, recorded as male (M) or female (F).

• Education (E): The highest level of education or training completed by the individual, recorded either high school (high) or university degree (uni).

• Occupation (O): It is recorded as an employee (emp) or a self employed (self) worker.

• Residence (R): The size of the city the individual lives in, recorded as small (small) or big (big).

• Travel (T): The means of transport favoured by the individual, recorded as car (car), train (train) or other (other)

Travel is the target of the survey, the quantity of interest whose behaviour is under investigation.

### 3.2.2 Visualizing a Bayesian network

We can represent the relationships between the variables in the survey data by a directed graph where each node correspond to a variable in data and each edge represents conditional dependencies between pairs of variables.

In bnlearn, we can graphically represent the relationships between variables in survey data like this:

# empty graph
library(bnlearn)
## Warning: package 'bnlearn' was built under R version 3.5.2
##
## Attaching package: 'bnlearn'
## The following objects are masked from 'package:BiocGenerics':
##
##     path, score
## The following object is masked from 'package:stats':
##
##     sigma
dag <- empty.graph(nodes = c("A","S","E","O","R","T"))
arc.set <- matrix(c("A", "E",
"S", "E",
"E", "O",
"E", "R",
"O", "T",
"R", "T"),
byrow = TRUE, ncol = 2,
dimnames = list(NULL, c("from", "to")))
arcs(dag) <- arc.set
nodes(dag)
## [1] "A" "S" "E" "O" "R" "T"
arcs(dag)
##      from to
## [1,] "A"  "E"
## [2,] "S"  "E"
## [3,] "E"  "O"
## [4,] "E"  "R"
## [5,] "O"  "T"
## [6,] "R"  "T"

### 3.2.3 Plotting the DAG

In this section we discuss the ways that we can visually demonstrate Bayesian networks.

You can either use the simple plot function or use the graphviz.plot function from Rgraphviz package.

# plot dag with plot function
plot(dag)

# plot dag with graphviz.plot function. Default layout is dot
graphviz.plot(dag, layout = "dot")

# plot dag with graphviz.plot function. change layout to "fdp"
graphviz.plot(dag, layout = "fdp")

# plot dag with graphviz.plot function. change layout to "circo"
graphviz.plot(dag, layout = "circo")

### 3.2.4 Highlighting specific nodes

If you want to change the color of the nodes or the edges of your graph, you can do this easily by adding a highlight input to the graphviz.plot function.

Let’s assume that we want to change the color of all the nodes and edges of our dag to blue.

hlight <- list(nodes = nodes(dag), arcs = arcs(dag),
col = "blue", textCol = "blue")
pp <- graphviz.plot(dag, highlight = hlight)

The look of the arcs can be customised as follows using the edgeRenderInfo function from Rgraphviz.

edgeRenderInfo(pp) <- list(col = c("S~E" = "black", "E~R" = "black"),
lwd = c("S~E" = 3, "E~R" = 3))

Attributes being modified (i.e., col for the colour and lwd for the line width) are specified again as the elements of a list. For each attribute, we specify a list containing the arcs we want to modify and the value to use for each of them. Arcs are identified by labels of the form parent∼child, e.g., S → E is S~E.

Similarly, we can highlight nodes with nodeRenderInfo. We set their colour and the colour of the node labels to black and their background to grey.

nodeRenderInfo(pp) <- list(col = c("S" = "black", "E" = "black", "R" = "black"),
fill = c("E" = "grey"))

Once we have made all the desired modifications, we can plot the DAG again with the renderGraph function from Rgraphviz.

renderGraph(pp)

### 3.2.5 The directed acyclic graph as a representation of joint probability

The DAG represents a factorization of the joint probability distribution into a joint probability distribution. In this section we show how to add custom probability distributions to a DAG, as well as how to estimate the parameters of the conditional probability distribution using maximum likelihood estimation or Bayesian estimation.

### 3.2.6 Specifying the probability distributions on your own

Given the DAG, the joint probability distribution of the survey data variables factorizes as follows:

$$Pr(A, S, E, O, R, T) = Pr(A) Pr(S) Pr(E | A, S) Pr(O | E) Pr(R | E) Pr(T | O, R).$$

A.lv <- c("young", "adult", "old")
S.lv <- c("M", "F")
E.lv <- c("high", "uni")
O.lv <- c("emp", "self")
R.lv <- c("small", "big")
T.lv <- c("car", "train", "other")

A.prob <- array(c(0.3,0.5,0.2), dim = 3, dimnames = list(A = A.lv))
S.prob <- array(c(0.6,0.4), dim = 2, dimnames = list(S = S.lv))
E.prob <- array(c(0.75,0.25,0.72,0.28,0.88,0.12,0.64,0.36,0.70,0.30,0.90,0.10), dim = c(2,3,2), dimnames = list(E = E.lv, A = A.lv, S = S.lv))
O.prob <- array(c(0.96,0.04,0.92,0.08), dim = c(2,2), dimnames = list(O = O.lv, E = E.lv))
R.prob <- array(c(0.25,0.75,0.2,0.8), dim = c(2,2), dimnames = list(R = R.lv, E = E.lv))
T.prob <- array(c(0.48,0.42,0.10,0.56,0.36,0.08,0.58,0.24,0.18,0.70,0.21,0.09), dim = c(3,2,2), dimnames = list(T = T.lv, O = O.lv, R = R.lv))
cpt <- list(A = A.prob, S = S.prob, E = E.prob, O = O.prob, R = R.prob, T = T.prob)
# custom cpt table
cpt
## $A ## A ## young adult old ## 0.3 0.5 0.2 ## ##$S
## S
##   M   F
## 0.6 0.4
##
## $E ## , , S = M ## ## A ## E young adult old ## high 0.75 0.72 0.88 ## uni 0.25 0.28 0.12 ## ## , , S = F ## ## A ## E young adult old ## high 0.64 0.7 0.9 ## uni 0.36 0.3 0.1 ## ## ##$O
##       E
## O      high  uni
##   emp  0.96 0.92
##   self 0.04 0.08
##
## $R ## E ## R high uni ## small 0.25 0.2 ## big 0.75 0.8 ## ##$T
## , , R = small
##
##        O
## T        emp self
##   car   0.48 0.56
##   train 0.42 0.36
##   other 0.10 0.08
##
## , , R = big
##
##        O
## T        emp self
##   car   0.58 0.70
##   train 0.24 0.21
##   other 0.18 0.09

Now that we have defined both the DAG and the local distribution corresponding to each variable, we can combine them to form a fully-specified BN. We combine the DAG we stored in dag and a list containing the local distributions, which we will call cpt, into an object of class bn.fit called bn.

# fit cpt table to network
bn <- custom.fit(dag, cpt)

## 3.3 Estimating parameters of conditional probability tables

For the hypothetical survey described in this chapter, we have assumed to know both the DAG and the parameters of the local distributions defining the BN. In this scenario, BNs are used as expert systems, because they formalise the knowledge possessed by one or more experts in the relevant fields. However, in most cases the parameters of the local distributions will be estimated (or learned) from an observed sample.

survey <- read.table("data/survey.txt", header = TRUE)
head(survey)
##       A     R    E   O S     T
## 1 adult   big high emp F   car
## 2 adult small  uni emp M   car
## 3 adult   big  uni emp F train
## 4 adult   big high emp M   car
## 5 adult   big high emp M   car
## 6 adult small high emp F train

In the case of this survey, and of discrete BNs in general, the parameters to estimate are the conditional probabilities in the local distributions. They can be estimated, for example, with the corresponding empirical frequencies in the data set, e.g.,

$$\hat{Pr}(O = emp | E = high) = \frac{\hat{Pr}(O = emp, E = high)}{\hat{Pr}(E = high)}= \frac{\text{number of observations for which O = emp and E = high}}{\text{number of observations for which E = high}}$$

This yields the classic frequentist and maximum likelihood estimates. In bnlearn, we can compute them with the bn.fit function. bn.fit complements the custom.fit function we used in the previous section; the latter constructs a BN using a set of custom parameters specified by the user, while the former estimates the same from the data.

bn.mle <- bn.fit(dag, data = survey, method = "mle")
bn.mle
##
##   Bayesian network parameters
##
##   Parameters of node A (multinomial distribution)
##
## Conditional probability table:
## 0.472 0.208 0.320
##
##   Parameters of node S (multinomial distribution)
##
## Conditional probability table:
##      F     M
## 0.402 0.598
##
##   Parameters of node E (multinomial distribution)
##
## Conditional probability table:
##
## , , S = F
##
##       A
##   high 0.6391753 0.8461538 0.5384615
##   uni  0.3608247 0.1538462 0.4615385
##
## , , S = M
##
##       A
##   high 0.7194245 0.8923077 0.8105263
##   uni  0.2805755 0.1076923 0.1894737
##
##
##   Parameters of node O (multinomial distribution)
##
## Conditional probability table:
##
##       E
## O            high        uni
##   emp  0.98082192 0.92592593
##   self 0.01917808 0.07407407
##
##   Parameters of node R (multinomial distribution)
##
## Conditional probability table:
##
##        E
## R            high       uni
##   big   0.7178082 0.8666667
##   small 0.2821918 0.1333333
##
##   Parameters of node T (multinomial distribution)
##
## Conditional probability table:
##
## , , R = big
##
##        O
## T              emp       self
##   car   0.58469945 0.69230769
##   other 0.19945355 0.15384615
##   train 0.21584699 0.15384615
##
## , , R = small
##
##        O
## T              emp       self
##   car   0.54700855 0.75000000
##   other 0.07692308 0.25000000
##   train 0.37606838 0.00000000

Note that we assume we know the structure of the network, so dag is an input of bn.fit function.

As an alternative, we can also estimate the same conditional probabilities in a Bayesian setting, using their posterior distributions. In this case, the method argument of bn.fit must be set to “bayes”.

bn.bayes <- bn.fit(dag, data = survey, method = "bayes", iss = 10)

The estimated posterior probabilities are computed from a uniform prior over each conditional probability table. The iss optional argument, whose name stands for imaginary sample size (also known as equivalent sample size), determines how much weight is assigned to the prior distribution compared to the data when computing the posterior. The weight is specified as the size of an imaginary sample supporting the prior distribution.

### 3.3.1 Fit dag to data and predict the value of latent variable

# predicting a variable in the test set.
training = bn.fit(model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]"),
gaussian.test[1:2000, ])
test = gaussian.test[2001:nrow(gaussian.test), ]
predicted <- predict(training, node = "A", data = test, method = "bayes-lw")
head(predicted)
## [1] 3.2905037 0.5755309 0.6127167 1.7173029 1.0264255 1.0253681

about the method bayes-lw: the predicted values are computed by averaging likelihood weighting simulations performed using all the available nodes as evidence (obviously, with the exception of the node whose values we are predicting). If the variable being predicted is discrete, the predicted level is that with the highest conditional probability. If the variable is continuous, the predicted value is the expected value of the conditional distribution.

## 3.4 Conditional independence in Bayesian networks

Using a DAG structure we can investigate whether a variable is conditionally independent from another variable given a set of variables from the DAG. If the variables depend directly on each other, there will be a single arc connecting the nodes corresponding to those two variables. If the dependence is indirect, there will be two or more arcs passing through the nodes that mediate the association.

If $$\textbf{X}$$ and $$\textbf{Y}$$ are separated by $$\textbf{Z}$$, we say that $$\textbf{X}$$ and $$\textbf{Y}$$ are conditionally independent given $$\textbf{Z}$$ and denote it with,

$\textbf{X } { \!\perp\!\!\!\perp}_{G} \textbf{Y } | \textbf{ Z}$ Conditioning on $$\textbf{Z}$$ is equivalent to fixing the values of its elements, so that they are known quantities.

$$\textbf{Definition (MAPS).}$$ Let M be the dependence structure of the probability distribution P of data $$\textbf{D}$$, that is, the set of conditional independence relationships linking any triplet $$\textbf{X}$$, $$\textbf{Y}$$, $$\textbf{Z}$$ of subsets of $$\textbf{D}$$. A graph G is a dependency map (or D-map) of M if there is a one-to-one correspondence between the random variables in $$\textbf{D}$$ and the nodes $$\textbf{V}$$ of G such that for all disjoint subsets $$\textbf{X}$$, $$\textbf{Y}$$, $$\textbf{Z}$$ of $$\textbf{D}$$ we have

$\textbf{X } {\!\perp\!\!\!\perp}_{P} \textbf{ Y } | \textbf{ Z} \Longrightarrow \textbf{X } {\!\perp\!\!\!\perp}_{G} \textbf{ Y } | \textbf{ Z}$

Similarly, G is an independency map (or I-map) of M if

$\textbf{X } {\!\perp\!\!\!\perp}_{P} \textbf{ Y } | \textbf{ Z} \Longleftarrow \textbf{X } {\!\perp\!\!\!\perp}_{G} \textbf{ Y } | \textbf{ Z}$ G is said to be a perfect map of M if it is both a D-map and an I-map, that is

$\textbf{X } {\!\perp\!\!\!\perp}_{P} \textbf{ Y } | \textbf{ Z} \Longleftrightarrow \textbf{X } {\!\perp\!\!\!\perp}_{G} \textbf{ Y } | \textbf{ Z}$ and in this case G is said to be faithful or isomorphic to M.

$$\textbf{Definition.}$$ A variable V is a collider or has a V structure, if it has 2 upcoming parents.

You can find all the V structures of a DAG:

vstructs(dag)
##      X   Z   Y
## [1,] "A" "E" "S"
## [2,] "O" "T" "R"

Note that conditioning on a collider induces dependence even though the parents aren’t directly connected.

$$\textbf{Definition (d-separation)}$$ If G is a directed graph in which $$\textbf{X}$$, $$\textbf{Y}$$ and $$\textbf{Z}$$ are disjoint sets of vertices, then $$\textbf{X}$$ and $$\textbf{Y}$$ are d-connected by $$\textbf{Z}$$ in G if and only if there exists an undirected path U between some vertex in $$\textbf{X}$$ and some vertex in $$\textbf{Y}$$ such that for every collider C on U, either C or a descendent of C is in $$\textbf{Z}$$, and no non-collider on U is in $$\textbf{Z}$$.

$$\textbf{X}$$ and $$\textbf{Y}$$ are d-separated by $$\textbf{Z}$$ in G if and only if they are not d-connected by $$\textbf{Z}$$ in G.

We assume that graphical separation ($${\!\perp\!\!\!\perp}_{G}$$) implies probabilistic independence ($${\!\perp\!\!\!\perp}_{P}$$) in a Bayesian network.

We can investigate whether two nodes in a bn object are d-separated using the dsep function. dsep takes three arguments, x, y and z, corresponding to $$\textbf{X}$$, $$\textbf{Y}$$ and $$\textbf{Z}$$; the first two must be the names of two nodes being tested for d-separation, while the latter is an optional d-separating set. So, for example,

dsep(dag, x = "S", y = "R")
## [1] FALSE
dsep(dag, x = "O", y = "R")
## [1] FALSE
dsep(dag, x = "S", y = "R", z = "E")
## [1] TRUE

### 3.4.1 Markov Property, Equivalence classes and CPDAGS

$$\textbf{Definition (Local Markov property)}$$ Each node $$X_i$$ is conditionally independent of its non-descendants given its parents.

Compared to the previous decomposition, it highlights the fact that parents are not completely independent from their children in the BN; a trivial application of Bayes’ theorem to invert the direction of the conditioning shows how information on a child can change the distribution of the parent.

Second, assuming the DAG is an I-map also means that serial and divergent connections result in equivalent factorisations of the variables involved. It is easy to show that

\begin{align} Pr(X_i) Pr(X_j | X_i) Pr(X_k | X_j) &= Pr(X_j,X_i) Pr(X_k | X_j)\\ &= Pr(X_i | X_j) Pr(X_j) Pr(X_k | X_j) \end{align}

Then $$X_i \longrightarrow X_j \longrightarrow X_k$$ and $$X_i \longleftarrow X_j \longrightarrow X_k$$ are equivalent. As a result, we can have BNs with different arc sets that encode the same conditional independence relationships and represent the same global distribution in different (but probabilistically equivalent) ways. Such DAGs are said to belong to the same equivalence class.

### 3.4.2 Skeleton of a network, CPDAGs and equivalence classes

The skeleton of a network is the network without any direction. Here is the skeleton of the dag for survey dataset.

graphviz.plot(skeleton(dag))

$$\textbf{Theorem. (Equivalence classes)}$$ Two DAGs defined over the same set of variables are equivalent and only they have the same skeleton (i.e., the same underlying undirected graph) and the same v-structures.

data(learning.test)
learn.net1 = empty.graph(names(learning.test))
learn.net2 = empty.graph(c("A","B","C","D","E","F"))
modelstring(learn.net1) = "[A][C][F][B|A][D|A:C][E|B:F]"
arc.set2 <- matrix(c("B", "A",
"A", "D",
"C", "D",
"B", "E",
"F", "E"),
byrow = TRUE, ncol = 2,
dimnames = list(NULL, c("from", "to")))
arcs(learn.net2) <- arc.set2
graphviz.compare(learn.net1,learn.net2)

score(learn.net1, learning.test, type = "loglik")
## [1] -23832.13
score(learn.net2, learning.test, type = "loglik")
## [1] -23832.13
# type == "loglik" means you get the log likelihood of the data given the dag and the MLE of the parameters

In other words, the only arcs whose directions are important are those that are part of one or more v-structures.

The skeleton of a DAG and it’s V structures identify the equivalence class the DAG belongs to, which is represented by the completed partially directed graph (CPDAG). We can obtain it from a DAG with cpdag function.

X <- paste("[X1][X3][X5][X6|X8][X2|X1][X7|X5][X4|X1:X2]",
"[X8|X3:X7][X9|X2:X7][X10|X1:X9]", sep = "")
dag2 <- model2network(X)
par(mfrow = c(1,2))
graphviz.plot(dag2)
graphviz.plot(cpdag(dag2))

### 3.4.3 Moral Graphs

In previous Section we introduced an alternative graphical representation of the DAG underlying a BN: the CPDAG of the equivalence class the BN belongs to. Another graphical representation that can be derived from the DAG is the moral graph.

The moral graph is an undirected graph that is constructed as follows:

1. connecting the non-adjacent nodes in each v-structure with an undirected arc;

2. ignoring the direction of the other arcs, effectively replacing them with undirected arcs.

This transformation is called moralisation because it “marries” non-adjacent parents sharing a common child. In the case of our example dag, we can create the moral graph with the moral function as follows:

graphviz.plot(moral(dag2))

Moralisation has several uses. First, it provides a simple way to transform a BN into the corresponding Markov network, a graphical model using undirected graphs instead of DAGs to represent dependencies.

In a Markov network, we say that $$\textbf{X} {\!\perp\!\!\!\perp}_{G} \textbf{Y} | \textbf{Z}$$ if every path between $$\textbf{X}$$ and $$\textbf{Y}$$ contains some node $$Z \in \textbf{Z}$$.

## 3.5 Plotting Conditional Probability Distributions

Plotting the conditional probabilities associated with a conditional probability table or a query is also useful for diagnostic and exploratory purposes. Such plots can be difficult to read when a large number of conditioning variables is involved, but nevertheless they provide useful insights for most synthetic and real-world data sets.

As far as conditional probability tables are concerned, bnlearn provides functions to plot barcharts (bn.fit.barchart) and dot plots (bn.fit.dotplot) from bn.fit objects. Both functions are based on the lattice package. For example let’s look at the conditional plot of $$Pr(T | R,O)$$:

bn.fit.barchart(bn.mle\$T, main = "Travel",
xlab = "Pr(T | R,O)", ylab = "")
## Loading required namespace: lattice