R: Search for the best subset of size k from n choices.

kofnGA {kofnGA}

R Documentation

Search for the best subset of size k from n choices.

Description

kofnGA implements a genetic algorithm for subset selection. The function searches for a subset of a fixed size, k, from the integers 1:n, such that user-supplied function OF is minimized at that subset. The selection step is done by tournament selection based on ranks, and elitism may be used to retain the best solutions from one generation to the next. Population objective function values can be evaluated in parallel.

Usage

kofnGA(n, k, OF, popsize = 200, keepbest = floor(popsize/10),
  ngen = 500, tourneysize = max(ceiling(popsize/10), 2),
  mutprob = 0.01, mutfrac = NULL, initpop = NULL, verbose = 0,
  cluster = NULL, sharedmemory = FALSE, ...)

Arguments

`n`	The maximum permissible index (i.e., the size of the set we are doing subset selection from). The algorithm chooses a subset of integers from 1 to `n`.
`k`	The number of indices to choose (i.e., the size of the subset).
`OF`	The objective function. The first argument of OF should be an index vector of length `k` containing integers in the range [1, `n`]. Additional arguments can be passed to `OF` through `...`.
`popsize`	The size of the population; equivalently, the number of offspring produced each generation.
`keepbest`	The `keepbest` least fit offspring each generation are replaced by the `keepbest` most fit members of the previous generation. Used to implement elitism.
`ngen`	The number of generations to run.
`tourneysize`	The number of individuals involved in each tournament at the selection stage.
`mutprob`	The probability of mutation for each of the `k` chosen indices in each individual. An index chosen for mutation jumps to any other unused index, uniformly at random. This probability can be set indirectly through `mutfrac`.
`mutfrac`	The average fraction of offspring that will experience at least one mutation. Equivalent to setting `mutprob` to `1 - (1 - mutfrac)^(1/k)`. Only used if `mutprob` is not supplied. This method of controlling mutation may be preferable if the algorithm is being run at different values of `k`.
`initpop`	A `popsize`-by-`k` matrix of starting solutions. The final populations from one GA search can be passed as the starting point of the next search. Possibly useful if using this function in an adaptive, iterative, or parallel scheme (see examples).
`verbose`	An integer controlling the display of progress during search. If `verbose` takes positive value `v`, then the iteration number and best objective function value are displayed at the console every `v` generations. Otherwise nothing is displayed. Default is zero (no display).
`cluster`	If non-null, the objective function evaluations for each generation are done in parallel. `cluster` can be either a cluster as produced by `makeCluster`, or an integer number of parallel workers to use. If an integer, `makeCluster(cluster)` will be called to create a cluster, which will be stopped on function exit.
`sharedmemory`	If `cluster` is non-null and `sharedmemory` is `TRUE`, the parallel computation will employ shared-memory techniques to reduce the overhead of repeatedly passing the population matrix to worker threads. Uses code from the `Rdsm` package, which depends on `bigmemory`.
`...`	Additional arguments passed to `OF`.

Details

Tournament selection involves creating mating "tournaments" where two groups of tourneysize solutions are selected at random without regard to fitness. Within each tournament, victors are chosen by weighted sampling based on within-tournament fitness ranks (larger ranks given to more fit individuals). The two victors become parents of an offspring. This process is carried out popsize times to produce the new population.
Crossover (reproduction) is carried out by combining the unique elements of both parents and keeping k of them, chosen at random.
Increasing tourneysize will put more "selection pressure" on the choice of mating pairs, and will speed up convergence (to a local optimum) accordingly. Smaller tourneysize values will conversely promote better searching of the solution space.
Increasing the size of the elite group (keepbest) also promotes more homogeneity in the population, thereby speeding up convergence.

Value

A list of S3 class "GAsearch" with the following elements:

`bestsol`	A vector of length `k` holding the best solution found.
`bestobj`	The objective function value for the best solution found.
`pop`	A `popsize`-by-`k` matrix holding the final population, row-sorted in order of increasing objective function. Each row is an index vector representing one solution.
`obj`	The objective function values corresponding to each row of pop.
`old`	A list holding information about the search progress. Its elements are:
`old$best`	The sequence of best solutions known over the course of the search (an `(ngen+1)`-by-`k` matrix)
`old$obj`	The sequence of objective function values corresponding to the solutions in `old$best`
`old$avg`	The average population objective function value over the course of the search (a vector of length `ngen+1`). Useful to give a rough indication of population diversity over the search. If the average fitness is close to the best fitness in the population, most individuals are likely quite similar to each other.

Notes on parallel evaluation

Specifying a cluster allows OF to be evaluated over the population in parallel. The population of solutions will be distributed among the workers in cluster using static dispatching. Any cluster produced by makeCluster should work, though the sharedmemory option is only appropriate for a cluster of workers on the same multicore processor.

Solutions must be sent to workers (and results gathered back) once per generation. This introduces communication overhead. Overhead can be reduced, but not eliminated, by setting sharedmemory=TRUE. The impact of parallelization on run time will depend on how the run time cost of evaluationg OF compares to the communication overhead. Test runs are recommended to determine if parallel execution is beneficial in a given situation.

Note that only the objective function evaluations are distributed to the workers. Other parts of the algorithm (mutation, crossover) are computed serially. As long as OF is deterministic, reproducibility of the results from a given random seed should not be affected by the use of parallel computation.

References

Mark A. Wolters (2015), “A Genetic Algorithm for Selection of Fixed-Size Subsets, with Application to Design Problems,” Journal of Statistical Software, volume 68, Code Snippet 1, available online.

Examples

#---Find the four smallest numbers in a random vector of 100 uniforms---
# Generate the numbers and sort them so the best solution is (1,2,3,4).
Numbers <- sort(runif(100))
Numbers[1:6]                                              #-View the smallest numbers.
ObjFun <- function(v, some_numbers) sum(some_numbers[v])  #-The objective function.
ObjFun(1:4, Numbers)                                      #-The global minimum.
out <- kofnGA(n = 100, k = 4, OF = ObjFun, ngen = 50, some_numbers = Numbers)  #-Run the GA.
summary(out)
plot(out)

## Not run: 
# Note: the following two examples take tens of seconds to run (on a 2018 laptop).

#---Harder: find the 50x50 principal submatrix of a 500x500 matrix s.t. determinant is max---
# Use eigenvalue decomposition and QR decomposition to make a matrix with known eigenvalues.
n <- 500                                         #-Dimension of the matrix.
k <- 50                                          #-Size of subset to sample.
eigenvalues <- seq(10, 1, length.out=n)          #-Choose the eigenvalues (all positive).
L <- diag(eigenvalues)
RandMat <- matrix(rnorm(n^2), nrow=n)
Q <- qr.Q(qr(RandMat))
M <- Q %*% L %*% t(Q)
M <- (M+t(M))/2                                  #-Enusre symmetry (fix round-off errors).
ObjFun <- function(v,Mat)  -(determinant(Mat[v,v],log=TRUE)$modulus)
out <- kofnGA(n=n, k=k, OF=ObjFun, Mat=M)
print(out)
summary(out)
plot(out)

#---For interest: run GA searches iteratively (use initpop argument to pass results)---
# Alternate running with mutation probability 0.05 and 0.005, 50 generations each time.
# Use the same problem as just above (need to run that first).
mutprob <- 0.05
result <- kofnGA(n=n, k=k, OF=ObjFun, ngen=50, mutprob=mutprob, Mat=M)  #-First run (random start)
allavg <- result$old$avg                       #-For holding population average OF values
allbest <- result$old$obj                      #-For holding population best OF values
for(i in 2:10) {
  if(mutprob==0.05) mutprob <- 0.005 else mutprob <- 0.05
  result <- kofnGA(n=n, k=k, OF=ObjFun, ngen=50, mutprob=mutprob, initpop=result$pop, Mat=M)
  allavg <- c(allavg, result$old$avg[2:51])
  allbest <- c(allbest, result$old$obj[2:51])
}
plot(0:500, allavg, type="l", col="blue", ylim=c(min(allbest), max(allavg)))
lines(0:500, allbest, col="red")
legend("topright", legend=c("Pop average", "Pop best"), col=c("blue", "red"), bty="n",
       lty=1, cex=0.8)
summary(result)

## End(Not run)

[Package kofnGA version 1.3 Index]