Chapter 11 Neural networks

Artificial neural networks are a class of extremely powerful techniques that have become quite popular in recent years. The reason is that they can produce very accurate predictions when used in supervised data mining applications.

These networks are very flexible algorithms that can be applied to different types of modeling including supervised and unsupervised problems.

Neural networks can be used in place of or in conjunction with logistic regression and decision trees when there is a categorical dependent variable. Neural networks are very flexible – they also work with continuous dependent variable, so they can be used in a regression-type setting.

In applications where regression, logit, decision trees, and other techniques might be used, neural nets can evolve much more complex, more flexible, and potentially more accurate models. The downside is that the models are often difficult to interpret and explain.

Neural nets are especially effective where there are many input variables, and these have non-linear relationships with the target variable. What’s fascinating about neural nets is that the model structure needs only to be specified in terms of the number of nodes and hidden layers. The analyst does not have to be concerned about non-linearities and/or interactions among predictors.

In a sense, when using neural nets, the computer learns from the data. A specific model is not specified as with regression models. Instead, the process works like this: “Here’s my data, this is how complicated the net can be. Develop a predictive the model.” These are not statistical models but rather powerful computer programs. Thus, no assumptions are made about normality, linearity, etc. This has led to the concept of “machine learning.”

The flexibility and complexity of neural net models is both the source of the attractiveness of neural nets as well as part of the challenges with effectively using them. Neural nets work best when there is many observations where training, validation, and test subsets can be formed.

Neural nets can be actually very easy to apply and use with modern software. There are many software programs available. The resulting models when using neural networks can be quite complicated even though in one sense these are just a combination of non-linear regression models. It’s the combination of many simple models that makes artificial neural nets complicated.

11.1 What are artificial neural networks?

The “artificial” adjective is used because these models were inspired by attempts to simulate biological neural systems. The first neural networks were not originally developed by data analysts, computer experts, or statisticians. It was the original research into human brain activity that led to the development of the computer models.

The artificial neural net works this way, too, although the number of elements in even the most complicated neural networks is nowhere near the billions in the human brain. (The human brain is thought to contain 100 billion neurons.) So, artificial neural nets as used in data mining are nowhere near as proficient or as complicated as the human brain. Despite this, much of the terminology persists from the original research that was done on the human brain, with terminology being used with terms such as neurons, learning, nodes, activation functions, and synapses in machine learning neural networks.

11.1.1 Human neurons to mathematical models

Figure 11.1 (Source: (Unal and Başçiftci 2021)) is a simplified model of a typical human neuron. In very basic terms, the neuron works as follows. The dendrites receive chemical and electrical signals from other neurons. The soma (nucleus) processes the information from the dendrites and creates an output which is transmitted by the axon. The axon is then connected via synapses to other neurons. With many neurons combined in a network, the result is the powerful capabilities of the human mind.

Simplified model of human neural net

Figure 11.1: Simplified model of human neural net

In 1943 McCulloch and Pitts, two neurophysiologists at Yale University, were interested in understanding the anatomy and functioning of the human brain (McCulloch and Pitts 1943). They proposed a mathematical model to explain how human neurons worked to make decisions and create insights. They hypothesized that the human brain works by using millions of relatively simple elements, essentially on-off switches.

Their idea was that a very complicated set of behaviors, such as those evidenced by the human brain, can arise from a set of relatively simple units if enough of them are acting in concert or sequence. McCulloch and Pitts proposed that the neurons were activated in a binary manner - either “fire” or “not-fire.” The basic element in their model can be stated mathematically as:

\[\begin{equation} {S} = \sum_{i=1}^{n} I_{i} W_{i} \end{equation}\]

\[y(S) = \begin{cases} 1, & \text{if } S\geq T\\ 0, & \text{otherwise} \end{cases} \]

Where \(I_1\), \(I_2\),…,\(I_n\) are binary input values and \(W_1\), \(W_2\),…, \(W_n\) are weights associated with each input, \(S\) is the weighted sum of inputs and \(T\) is the threshold value for the neuron activation.

11.1.2 Activation functions

The weighted sum is submitted to an activation function, which translates the sum into a value based on the range of the function. (Figure 11.2). While it is possible to have a linear activation function, most activation functions are non-linear. Using only linear activations would essentially re-create ordinary regression using neural networks.

Examples of activation functions used in neural nets

Figure 11.2: Examples of activation functions used in neural nets

Non-linear activation functions enable neural networks to model complex relationships between the inputs and outputs. In fact, neural nets can approximate any function to any desired degree of accuracy. This is known as the “universal approximation theorem” (Nielsen 2019).10

11.2 The road to machine learning with neural nets

Beginning in the 1950s when digital computers became available, computer scientists became aware of the perceptrons based on the work of the Yale University professors.

Scientists began trying to teach computers to learn. One example of the problems solved by these early neural networks was how to balance a broom standing upright on a moving car by controlling the motions of the cart back and forth. As the broom starts falling to the left, the cart learns to move to the left keep the room upright. While this was interesting, the promises of this early work were not realized.

Scientists began trying to teach computers to learn. One example of the problems solved by these early neural networks was how to balance a broom standing upright on a moving car by controlling the motions of the cart back and forth. As the broom starts falling to the left, the cart learns to move to the left keep the room upright. While this was interesting, the promises of this early work were not realized.

The excitement of the early 1950s gave way to disillusionment by the late 1960s. The disillusionment stemmed from the publication of a book by Marvin Minsky and Seymour Papert in 1969 showed some basic problems with perceptrons (Minsky and Papert 1969). For example, a perceptron could not model the so-called XOR (exclusive OR) problem. (Table 11.1.) The effect of Minsky and Papert’s paper was that funding for research into neural nets dried up for more than 10 years.

Table 11.1: The XOR problem
Input A Input B Output
0 0 0
0 1 1
1 0 1
1 1 1

By the early 1980s, however, researchers had devised a way to incorporate multiple layers of perceptrons into models and this multiple layering enabled these models to become extremely flexible. After that flurry of research developed.11

11.3 Example of a neural network

The class Iris data set consists of 150 observations 4 measured attributes (sepal length, sepal with, petal length, and petal.width) and one type of Iris as the target (setosa, versicolor, and viginica). The data was divided randomly into a training set (60%) and validation set ((40%). A neural network was fitted to the training data with a single hidden layer that had two nodes.

The resultant network is shown with the 19 parameter estimates in Figure 11.3. The circles with “1” represent the constant terms.

Neural net for Iris data

Figure 11.3: Neural net for Iris data

The results are shown in two confusion matrices, one for the training data (Table 11.2) and one for the validation data(Table 11.3). No errors were made with the training data and just two with the validation data. Reduced accuracy with the validation data is expected since this data was not used to create the model.

Table 11.2: Neural net results for the Iris training data
Training data setosa versicolor virginica
setosa 32 0 0
versicolor 0 28 0
viginica 0 0 30
Table 11.3: Neural net results for the Iris validation data
Validation data setosa versicolor virginica
setosa 18 0 0
versicolor 0 22 2
viginica 0 0 18

11.4 Training a neural net

There are several ways that have been developed for estimating the weights in a neural net. We don’t say estimation were talking about neural nets. We say training. And we use training data to adjust the weights and the model.

Probably the most common structure of a neural network is the so-called feed-forward model. This means that when the network is trained, the input data flows through the model in only one direction toward the output or target. There is no feedback built into the model. (This is not to be confused with the estimation technique called back-propagation, discussed below.)

Training a neural net is analogous to finding the coefficients for the best fit in regression model. One key difference, however, is with regression there is a single best fitting linear model which optimizes the fit to the set of training observations. There is no equivalent method for calculating the best set of weights for a neural network. Instead, an optimization model or routine is used to minimize some error function, such as the average squared error. This doesn’t guarantee an optimal result, but instead looks for a good result. (It is possible to get local optima.)

Probably the most common training method is called the back propagation method. It starts by randomly assigning a set of weights in the model and then calculates the value of the target. This provides the initial model, which most likely is not very good.

Then, the error from the initial model is calculated by subtracting the predicted target value calculated by the neural net from the actual value of the target. This error is then fed back through the model and the weights are adjusted up or down to try to minimize the error. The name back propagation is suggested because the errors are sent back to the network.

The adjustments made to the model weights are determined through a strategy called gradient descent. A gradient is a partial derivative of a function with more than one input variable. The gradient is measure of how much and in which direction the output of a function changes with small changes in the inputs. The sizes of changes in the gradient are set by the learning rate. Setting the learning rate too high may cause the algorithm to miss the optimum. Setting the learning rate too low is likely to lead to the optimum but at the cost of excessive computer time.

The learning process is repeated many times repeated until some criterion is reached, such as a pre-set value of computer processing time, a specified maximum number of iterations, or until the error associated with the weights is negligible.

This can be a slow process in terms of the number of iterations are required, but with modern computers many analyses can be completed in a matter of seconds. However, for some problems it could be hours. The actual mechanisms are quite sophisticated, having been developed over a period by mathematicians and computer scientists.

11.5 Considerations in using neural nets

11.5.1 Missing data

Neural nets cannot handle missing data, so imputation of values must be performed if any data values are missing.

11.5.2 Representative data

The training, verification and test data must be representative of the underlying model. The old computer science adage “garbage in, garbage out” could not apply more strongly than in neural modeling. If training data is not representative, then the model’s worth is at best compromised. At worst, it may be useless. It is worth spelling out the kind of problems which can corrupt a training set:

11.5.3 All eventualities must be covered

A neural network can only learn from cases that are present. If people with incomes over $100,000 per year might be bad credit risks and your training data does not include anyone with incomes over $40,000 per year, you cannot expect the model to make correct decisions on previously unseen cases. Extrapolation is dangerous with any model, but some types of neural network may make particularly poor predictions in such circumstances.

A network learns the easiest features it can. A classic (possibly apocryphal) illustration of this is a vision project designed to automatically recognize tanks. A network is trained on a hundred pictures including tanks, and a hundred not. It achieves a perfect 100% score. When tested on new data, it proves hopeless. The reason? The pictures of tanks are taken on dark, rainy days, the pictures without on sunny days. The network learns to distinguish the (trivial matter of) differences in overall light intensity. To work, the network would need training cases including all weather and lighting conditions under which it is expected to operate - not to mention all types of terrain, angles of shot, and distances.

11.5.4 Unbalanced data sets

Since a network minimizes an overall error, the proportion of types of data in the set is critical. A network trained on a data set with 900 good cases and 100 bad will bias its decision towards good cases, as this allows the algorithm to lower the overall error (which is much more heavily influenced by the good cases). If the representation of good and bad cases is different in the real population, the network’s decisions may be wrong. A good example would be disease diagnosis. Perhaps 90% of patients routinely tested are clear of a disease. A network is trained on an available data set with a 90/10 split. It is then used in diagnosis on patients complaining of specific problems, where the likelihood of disease is 50/50. The network will react over-cautiously and fail to recognize disease in some unhealthy patients.

In contrast, if trained on the “complaints” data, and then tested on “routine” data, the network may raise a high number of false positives. In such circumstances, the data set may need to be crafted to take account of the distribution of data (e.g., you could replicate the less numerous cases, or remove some of the numerous cases), or the network’s decisions modified by the inclusion of a loss matrix (Bishop, 1995). Often, the best approach is to ensure even representation of different cases, then to interpret the network’s decisions accordingly.

11.5.5 The overfitting problem

As with other data mining techniques the neural net model is trained on a separate set of data and then tested and validated on separate data sets. This is particularly important when using neural nets.

Neural nets can predict too well. That is, given enough flexibility with several hidden layers and a large number of nodes, neural net model can be developed to perfectly predict the target in the training data.

The problem is that over fitting like this does not generalize well. In other words, when you take a model that was fit perfectly to the training data, it may not predict the validation or testing data sets very well at all. This is overlearning. With sufficient iterations and enough nodes, neural nets can even fit random data.

To illustrate this, the Iris data set was again used. This time, however, the predictor variables were ordered randomly. This meant that the predictors (sepal length, sepal with, petal length, and petal.width) and target (setosa, versicolor, and viginica) were no longer correctly matched This time a larger neural net was specified with two hidden layers, each with 10 nodes.

The results are shown in the following two tables. The first confusion matrix is for the randomized training data. (Table 11.4) Note that perfect assignment of the types of Iris flowers was obtained.
Table 11.4: Neural net results on training data: Randomized Iris data
Training data setosa versicolor virginica
setosa 30 0 0
versicolor 0 31 0
viginica 0 0 29

The second confusion matrix, which applied the model to the validation data, showed that model was overfit. (Table 11.5) The model could not accurately predict new data.

Table 11.5: Neural net results on validation data: Randomized Iris data
Validation data setosa versicolor virginica
setosa 8 7 8
versicolor 2 5 6
viginica 10 7 7

11.6 Neural network example

The German credit data was used to illustrate a neural network. The data set contains 1,000 observations with 20 predictors and a binary target: “Credit risk.” A study of credit card defaults in Taiwan is available from the Machine Learning The source is the Repository at UCI (Gromping 2019b) with a detailed report on the corrections provided by (Gromping 2019a). The number of “bad” credit ratings has been oversampled; the actual prevalence of “bad” credit is about 5%. The “good” versus “bad” ratings are based on the debtor’s assessment of risk prior to granting credit. There are also unequal costs of errors for this example.

To account for the differences in the cost of errors, the cutoff threshold for predicting from the neural net had to be changed. The structure of the matrix and the resultant threshold is in Figure 11.4. The threshold was computed as 0.93 based on the costs and revenues associated with each cell in the 2X2 table using the approach developed by (Elkan 2001).

Threshold calculations

Figure 11.4: Threshold calculations

The variables in the German credit data set are shown in Table 11.6.

Table 11.6: Variables in the German credit data set.
Variable Description
Status of existing checking account Status of the debtor’s checking account with the bank
Duration in months Credit duration in months
Credit history History of compliance with previous or concurrent credit contracts
Purpose Purpose for which the credit is needed
Credit amount Credit amount
Savings account/bonds Debtor’s savings
Present employment since Duration of debtor’s employment with current employer
Installment rate in percentage of disposable income Credit installments as a percentage of debtor’s disposable income
Personal status and sex Combined information on sex and marital status
Other debtors / guarantors Is there another debtor or a guarantor for the credit?
Present residence since Length of time (in years) the debtor lives in the present residence
Property The debtor’s most valuable property
Age in years Age in years
Other installment plans Installment plans from providers other than the credit-giving bank
Housing Type of housing the debtor lives in
Number of existing credits at this bank Number of credits including (or had) at this bank
Job Type of debtor’s job
Number of people being liable to provide maintenance for Binary (1 or 2)
Telephone Is there a telephone landline registered on the debtor’s name?
Foreign worker Is the debtor a foreign worker?
Score Has the credit contract been complied with (good) or not (bad)?

The KNIME workflow for this example is shown below. (Figure 11.5

Workflow for neural net analysis of German Credit data set

Figure 11.5: Workflow for neural net analysis of German Credit data set

A description of each node is shown Table 11.7.

Table 11.7: Description of workflow nodes for German Credit neural net anslysis.
Node Label Description
1 File Reader Read CreditScore.csv
2 Category to Number Transfform all categorical variables to dummy indicators.
3 Normalizer Normanlize all variables to 0-1 min-max.
4 Partitioning Create training and validation data sets (70/30 ratio).
5 Rprop MLP Learner Neural net analysis using multi-layer perceptron model.
6 MLP Predictor Predict target using the validation data.
7 Scorer Assess predictions using threshold of 0.50.
8 Rule Engine Change threshold to 0.93.
9 Scorer Assess predictions using threshold of 0.93.

The results for the neural nets analysis are shown in Figure 11.6 for both the 0.50 threshold and the 0.93 threshold. Note that increasing the threshold to assign a prediction to the “good” category to 0.93 reduced the overall performance of the model. While the number of correct “good” predictions decreased, the number of correct “bad” predictions increased.

Results from neural net analysis

Figure 11.6: Results from neural net analysis

As noted above, the number of “bad” credit ratings has been oversampled to be 30% but the actual percentage of “bad” credit is about 5%. So, the confusion matrices need to be rebalanced to reflect the correct percentages. The rebalanced confusion matrices are shown below. (Figure 11.7) Rebalancing is a straightforward process of adjusting each cell of the matrices so that the row margin totals (the actual numbers of “good” and “bad” credit cases) match the population.

Workflow for neural net analysis of German Credit data set

Figure 11.7: Workflow for neural net analysis of German Credit data set

Also shown in Figure11.7 are the costs (and revenues, represented by negative costs) associated with each cell of the confusion matrix. By multiplying each cell of the predictions times the corresponding cell of the cost matrix, an overall net cost or revenue can be estimated. In this example, the predictions using the 0.93 threshold resulted in a revenue of 8,771 compared with the 0.50 default threshold which resulted in a revenue of 4,900. So, despite the reduced accuracy achieved with the higher threshold, the bank would be better off foregoing some good customers to avoid making bad credit decisions.

References

Elkan, Charles. 2001. “The Foundations of Cost-Sensitive Learning.” In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B24CDCB3FA2EBC3A4D5AEB2B35160B90?doi=10.1.1.29.514&rep=rep1&type=pdf.
Gromping, U. 2019a. “South German Credit Data: Correcting a Widely Used Data Set.” http://www1.beuth-hochschule.de/FB_II/reports/Report-2019-004.pdf.
———. 2019b. “South German Credit (UPDATE) Data Set.” https://archive.ics.uci.edu/ml/datasets/South+German+Credit+.
Jaspreet. 2016. “A Concise History of Neural Networks.” https://towardsdatascience.com/a-concise-history-of-neural-networks-2070655d3fec.
McCulloch, Warren S., and Walter H. Pitts. 1943. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics, 114–33.
Minsky, Marvin, and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. Cambridge, Mass.: MIT Press.
Nielsen, Michael. 2019. “Neural Networks and Deep Learning.” http://neuralnetworksanddeeplearning.com/.
Unal, Hamit Taner, and Fatih Başçiftci. 2021. “Evolutionary Design of Neural Network Architectures: A Review of Three Decades of Research.” https://link.springer.com/article/10.1007/s10462-021-10049-5.

  1. Michael Nielsen has an excellent, free online book called Neural Networks and Deep Learning,(Nielsen 2019).↩︎

  2. For more on the history of neural nets, consult "A concise history of neural networks by Jaspreet (Jaspreet 2016)↩︎