12.3 Genetic Algorithms

A genetic algorithm (GA) is an optimization tool that is based on concepts of evolution population biology (Mitchell 1998; Haupt, Haupt, and Haupt 1998). These algorithms have been shown to be able to locate the optimal or near-optimal solutions of complex functions (Mandal, Jeff Wu, and Johnson 2006). To effectively find optimal solutions the algorithm mimics the evolutionary process of a population by generating a candidate set of solutions for the optimization problem, allowing the solutions to reproduce and create new solutions (reproduction), and promoting competition to give the most fit solutions (i.e., optimal) the best chance to survive and populate the subsequent generation (natural selection). This process enables GAs harness good solutions over time to make better solutions, and has been shown to converge to an optimization plateau (Holland 1992).

The problem of identifying an optimal feature subset is also a complex optimization problem as discussed in Section 10.2. Therefore, if the feature selection problem can be framed in terms of the concepts of evolutionary biology, then the GA engine can be applied to search for optimal feature subsets. In biology, each subject in the population is represented by a chromosome, which is constructed from a sequence of genes. Each chromosome is evaluated based on a fitness criterion. The better the fitnes, the more likely a chromosome will be chosen to use in reproduction to create the next generation of chromosomes. The reproduction stage involves the process of crossover and mutation, where two chromosomes exchange a subset of genes. A small fraction of the offsprings’ genes are then changed due to the process of mutation.

For the purpose of feature selection, the population consists of all possible combinations of features for a particular data set. A chromosome in the population is a specific combination of features (i.e., genes), and the combination is represented by a string which has a length equal to the total number of features. The fitness criterion is the predictive ability of the selected combination of features. Unlike simulated annealing, the GA feature subsets are grouped into generations instead of considering one subset at a time. But a generation in a GAs is similar to an iteration in simulated annealing.

In practice, the initial population of feature subsets needs to be large enough to contain a sufficient amount of diversity across the feature subset space. Often 50 or more randomly selected feature subsets are chosen to seed the initial generation. Our approach is to create random subsets of features that contain between 10% and 90% of the total number of features.

Once the initial generation has been created, the fitness (or predictive ability) of each feature subset is estimated. As a toy example, consider a set of nine predictors (A though I) and an initial population of 12 feature subsets:

ID										Performance	Probability (%)
1			C			F				0.54	6.4
2	A			D	E	F		H		0.55	6.5
3				D						0.51	6.0
4					E					0.53	6.2
5				D			G	H	I	0.75	8.8
6		B			E		G		I	0.64	7.5
7		B	C			F			I	0.65	7.7
8	A		C		E		G	H	I	0.95	11.2
9	A		C	D		F	G	H	I	0.81	9.6
10			C	D	E				I	0.79	9.3
11	A	B		D	E		G	H		0.85	10.0
12	A	B	C	D	E	F	G		I	0.91	10.7

Each subset has an associated performance value and, in this case, larger values are better. Next, a subset of these features sets will be selected as parents to reproduce and form the next generation. A logical approach to selecting parents would be to choose the top-ranking features subsets as parents. However, this greedy approach often leads to lingering in a locally optimal solution. To avoid a local optimum, the selection of parents should be a function of the fitness criteria. The most common approach is to select parents is to use a weighted random sample with a probability of selection as a function of predictive performance performance. A simple way to compute the selection probability is to divide an individual feature subset’s performance value by the the sum of all the performance values. The rightmost column in the table above shows the results for this generation.

Suppose subsets 6 and 12 were selected as parents for the next generation. For these parents to reproduce, their genes will undergo a process of crossover and mutation which will result in two new feature subsets. In single-point crossover, a random location between two predictors is chosen⁸⁹. The parents then exchange features on one side of the crossover point⁹⁰. Suppose that the crossover point was between predictors D and E represented here:

ID
6		B			E		G		I
12	A	B	C	D	E	F	G		I

The resulting children of these parents would be:

ID
13		B			E	F	G		I
14	A	B	C	D	E		G		I

One last modification that can occur for a newly created feature subset is a mutation. A small random probability (1%–2%) is generated for each possible feature. If a feature is selected in the mutation process, then its state is changed within the current feature subset. This step adds an additional protection to lingering in local optimal solutions. Continuing with the children generated above, suppose that feature I in feature subset 13 was seleted for mutation. Since this feature is already present in subset 13, it would be removed since it was identified for mutation. The resulting children to be evaluated for the next generation would then be:

ID
13		B			E	F	G
14	A	B	C	D	E		G		I

The process of selecting two parents to generate two more offspring would continue until the new generation is populated.

The reproduction process does not gaurantee that the most fit chromosomes in the current generation will appear in the subsequent generation. This means that there is a chance that the subsequent generation may have less fit chromosomes. To ensure that the subsequent generation maintains the same fitness level, the best chromosome(s) can be carried, as is, into the subsequent generation. This process is called elitism. When elitism is used within a GA, the most fit solution across the observed generations will always be in the final generation. However, this process can cause the GA to linger longer in local optimums since the best chromosome will always have the highest probability of reproduction.

12.3.1 External Validation

An external resampling procedure should be used to determine how many iterations of the search are appropriate. This would average the assessment set predictions across all resamples to determine how long the search should proceed when it is directly applied to the entire training set. Basically, the number of iterations is once again a tuning parameter.

A genetic algorithm makes gradual improvements in predictive performance through changes to feature subsets over time. Similar to simulated annealing, an external resampling procedure should be used to select an optimal number of generations. Specifically, GAs are executed within each external resampling loop. Then the internal resamples are used to measure fitness so that different data are used for modeling and evaluation. This process helps prevent the genetic algorithm from finding a feature subset that overfits to the available data, which is a legitimate risk with this type of optimization tool.

Once the number of generations is established, the final selection process is performed on the training set and the best subset is used to filter the predictors for the final model.

To evaluate the performance of GAs for feature selection, we will again use the the OkCupid data with a naive Bayes model. In addition, the same data splitting scheme was used as with simulated annealing; 10-fold cross-validation was used for external resampling along with a 10% internal validation set. The search parameters were mostly left to the suggested defaults:

Generation size: 50,
Crossover probability: 80%,
Mutation probability: 1%,
Elitism: No,
Number of generations: 14⁹¹

For this genetic algorithm, it is helpful to examine how the members of each generation (i.e., the feature subsets) change across generations (i.e., search iterations). As we saw before with simulated annealing, simple metrics to understand the algorithm’s characteristics across generations are the feature subset size and the predictive performance estimates. Understanding characteristics of the members within a generation can also be useful. For example, the diversity of the subsets within and between generations can be quantified using a measure such as Jaccard similarity (Tan, Steinbach, and Kumar 2006). This metric is the proportion of the number of predictors in common between two subsets divided by total number of unique predictors in both sets. For example, if there were five possible predictors, the Jaccard similarity between subsets ABC and ABE would be 2/4 or 50%. This value provides insight on how efficiently the generic algorithm is converging towards a common solution or towards potential overfitting.

Figure 12.8 shows curves that monitor the subset size and predictive performance of the current best subset of each generation for each external resample. In addition, this figure illustrates the average similarity of the subsets within a generation to the current best subset for each external resample. Several characteristics stand out from this figure. First, despite the resamples having considerable variation in the size of the first generation, they all converge to subset sizes that contain between 61 and 85 predictors. The similarity plot indicates that the subsets within a generation are becoming more similar to the best solution. At the same time, the area under the ROC curves are beginning to plateau to an average value of 0.843.

Figure 12.8: The change in subset size, similarity, and performance over time for each external resample of the genetic algorithm.

Due to an an overall increasing trend in the average area under the ROC curve, the generation with the largest numerical ROC value was 15). Because the overall trend is still ongoing, additional generations may lead to a more optimal feature subset in a subsequent generation.

The final search was conducted on the entire training set and Figure 12.9 has the results. These are very similar to the internal trends shown previously.

Figure 12.9: Genetic algorithm results for the final search using the entire training set.

The final predictor set at generation 15 included more variables than the previous SA search and contains the following predictors:

age	education	height	income	time since last on-line
pets	astrological sign	smokes	town	religion (modifer)
astrological sign (modifer)	‘speaks’ C++	‘speaks’ C++ fluently	‘speaks’ C++ poorly	lisp
‘speaks’ lisp fluently	‘speaks’ lisp poorly	asian	black	indian
middle eastern	native american	other	pacific islander	white
total essay length	`software`	`engineer`	`startup`	`tech`
`computers`	`engineering`	`computer`	`internet`	`technology`
`technical`	`developer`	`programmer`	`scientist`	`code`
`geek`	`lol`	`biotech`	`matrix`	`geeky`
`solving`	`problems`	`data`	`fixing`	`teacher`
`student`	`silicon`	`law`	`mechanical`	`electronic`
`mobile`	`math`	`apps`	the number of digits	sentiment score (bing)
the number of first person text	the number of second person	the number of ‘to be’ occurances

To gauge the effectiveness of the search, 100 random subsets of size 63 were chosen and a naive Bayes model was developed for each. The GA selected subset performed better than 100% of the randomly selected subsets. This indicates that the GA did find a useful subset for predicting the response.

12.3.2 Coercing Sparsity

Genetic algorithms often tend to select larger feature subsets than other methods. This is likely due to the construction of the algorithm in that if a feature is useful for prediction, then it will be included in the feature subset. However, there is less of a penality for keeping a feature that has no impact on predictive performance. This is especially true when coupling a GA with a modeling technique that does not incur a substantial penalty for including irrelevant predictors (Section 10.3). In these cases, adding non-informative predictors does not coerce the selection process to strongly seek sparse solutions.

There are several benefits to a having a sparse solution. First, the model is often simpler, which may make it easier to understand. Fewer predictors also require less computational resources. Is there a way, then, that could be used to compell a search procedure to move towards a more sparse predictor subset solution?

A simple method for reducing the number of predictors is to use a surrogate measure of performance that has an explicit penalty based on the number of features. Section 11.4 introduced the Akaike information criterion (AIC) which augments the objective function with a penalty that is a function of the training set size and the number of model terms. This metric is specifically tailored to models that use the likelihood as the objective (i.e., linear or logistic regression), but cannot be applied to models whose is to optimize other metrics like area under the ROC curve or $R^2$. Also, AIC usually involves the number of model parameters instead of the number of predictors. For simple parametric regression models, these two quantities are very similar. But there are many models that have no notion of the number of parameters (e.g., trees, $K$-NN, etc.), while others can have many more parameters than samples in the training set (e.g., neural networks). However the principle of penalizing the objective based on the numof features can be extended to other models.

Penalizing the measure of performance enables the problem to be framed in terms of multi-parameter optimization (MPO). The MPO approach uses a function to combine many objective values into a final single value. The final value is then used to compare the goodness of different samples across the objectives. For the purpose of feature selection, one objective is to optimize predictive performance, and the other objective is to include as few features as possible. There are many different MPO approaches but a simple and straightforward approach is through desirability functions (Harrington 1965; Del Castillo, Montgomery, and McCarville 1996; Mandal et al. 2007). When using this methodology, each objective is translated to a new scale on [0, 1] where larger values are more desirable. For example, an undesirable area under the ROC curve would be 0.50 and a desirable area under the ROC curve would be 1.0. A simple desirability function would be a line that has zero desirability when the AUC is less than or equal to 0.50 and is 1.0 when the AUC is 1.0. Conversely, a desirability function for the number of predictors in the model would be 0.0 when all of the predictors are in the model would be 1.0 when only one predictor is in the model. Additional desirability functions can be incorporated to guide the modeling process towards other important characteristics.

Once all of the individual desirability functions are defined, the overall desirability statistic is created by taking the geometric mean of all of the individual functions. If there are $q$ desirability functions $d_i (i = 1\ldots q)$, the geometric mean is defined as:

\[ D = \left(\prod_{i=1}^q d_j\right)^{1/q} \]

Based on this function, if any one desirability function is unacceptable (i.e., $d_i = 0$), the overall measure is also unacceptable⁹². For feature selection applications, the internal performance metric, which is the one that the genetic algorithm uses to accept or reject subsets, can use the overall desirability to encourage the selection process to favor smaller subsets while still taking performance into account.

The most commonly used desirability functions for maximizing a quantity use piecewise linear or polynomial functions proposed by Derringer and Suich (1980):

\[\begin{equation} d_i= \begin{cases} 0 &\text{if $x < A$} \\ \left(\frac{x-A}{B-A}\right)^{s} &\text{if $A \leq x \leq B$} \\ 1 &\text{if $x>B$} \end{cases} \end{equation}\]

Figure 12.10(a) shows an example of a maximization function where $A = 0.50$, $B = 1.00$, and three different values of the scaling factor $s$. Note that values of $s$ less than 1.0 weaken the influence of the desirability function whereas values of $s > 1.0$ strengthen the influence on the desirability function.

When minimizing a quantity, an analogous function is used:

\[\begin{equation} d_i= \begin{cases} 0 &\text{if $x> B$} \\ \left(\frac{x-B}{A - B}\right)^{s} &\text{if $A \leq x \leq B$} \\ 1 &\text{if $x<A$} \end{cases} \end{equation}\]

Also shown in Figure 12.10(a) is a minimization example of scoring the number of predictors in a subset with values $A = 10$ and $B = 90$.

When a maximization function with $s = 1/5$ and a minimization function with $s = 1$ are combined using the geometric mean, the results are shown Figure 12.10(b). In this case, the contours are not symmetric since we have given more influence to the area under the ROC curve based on the value of $s$.

Figure 12.10: Examples of two types of desirability functions and the overall desirability surface when the two are combined.

Similar desirability functions exist for cases when a target value is best (i.e., a value between a minimum and maximum), for imposing box constraints, or for using qualitative inputs. Smoother functions could also be used in place of the piecewise equations.

For the OkCupid data, a desirability function with the following objectives was used to guide the GA:

maximize the area under the ROC curve between $A = 0.50$ and $B = 1.00$ with a scale factor of $s = 2.0$.
minimize the subset size to be within $A = 10$ and $B = 100$ with $s = 1.0$.

Overall desirability was used for internal performance while the external area under the ROC curve was still used as the primary external metric. Figure 12.11 shows a plot of the internal results similar to Figure 12.8. The overall desirability continually increased and probably would continue to do so with additional generations. The average subset size decreased to be about 28 by the last generation. Based on the averages of the internal resamples, generation 14 was associated with the largest desirability value and with a matching external estimate of the area under the ROC curve of 0.802. This is smaller than the unconstrained GA but the loss in efficacy is offset by the substantial reduction in the number of of predictors in the model.

Figure 12.11: Internal performance profiles for naive Bayes models using the genetic algorithm in conjunction with desirability functions.

For the final run of the genetic algorithm, 32 predictors were chosen at generation 15. These were:

education	height	religion	smokes	town
‘speaks’ C++ okay	lisp	‘speaks’ lisp poorly	asian	native american
white	`software`	`engineer`	`computers`	`science`
`programming`	`technical`	`code`	`lol`	`matrix`
`geeky`	`fixing`	`teacher`	`student`	`electronic`
`pratchett`	`math`	`systems`	`alot`	`valley`
`lawyer`	the number of puncts

Depending on the context, a potentially strong argument can be made for a smaller predictor set even if it shows slightly worse results (but better than the full model). If naive Bayes was a better choice than the other models, it would be worth evaluating the normal and sparse GAs on the test set to make a choice.

There are also multi-point crossover methods that create more than two regions where features are exchanged.↩
Often a random probability is also generated to determine if crossover should occur.↩
This is a low value for a genetic algorithm. For other applications, the number of generations is usually in the 100’s or 1000’s. However, a value this large would make the feature selection process computationally challenging due to the internal and external validation scheme.↩
If we do not want desirability scores to be forced to zero due to just one undesriable characterisitc, then the individual desirability functions can be constrained such that the minimum values are greater than zero.↩