问题
I need to solve a problem which entails comparing two matrices with the same number of columns. One of these is manipulated until the best match is obtained. The way I score the differences between the two matrices is quite convoluted and I still have to finalize it. What I'm really interested at the moment in is finding a search/optimization algorithm that works with positive integers only. I've created a simple example with a simple function to maximise. Let's say I have a dataset D.
D <- data.frame(rbind(c(1,1,1),
c(1,1,0),c(1,1,0),c(1,1,0),c(1,0,0),
c(0,0,0),c(1,0,0),c(1,0,0),c(1,1,0),
c(1,0,0),c(1,1,1),c(1,1,0),c(1,0,0),
c(1,0,0),c(1,0,1)))
I want to find which re-arrangement of Dx gives me the lowest absolute difference.
Dx<-data.frame(rbind(c(1,1,0),c(1,0,0),c(0,0,0),c(1,1,0)))
So I could go through all the possible permutations using the function below
library(combinat)
SPACE <- t(as.data.frame(list(permn(1:3))))
f <- function(x){
if(anyDuplicated(x)>0){return(0)}
Dist<-NA
for (i in 1:nrow(D)){
Dist[i]<-sum(abs(Dx[,x]-t(D[i,])))}
return(sum(Dist))}
apply(SPACE,1,f)
and get the right result.However this has 2 disadvantages for the data I'm actually using:
- I have to specify SPACE- all the possible column orders and
apply
goes through each possible permutations and calculates my error score.
Both A and B become computationally difficult as the number of columns in my matrix increases. I think even keeping all the possible permutations of the numbers 1 to 14 in one R session is impossible on most computers.
An optimization algorithm I found is grid search. This starts to address A. It means that I don't have to specify SPACE (i.e. all the possible permuatations), so it's one step in the right direction, as I want to look at much larger datasets.
library(NMOF)
gridSearch(f, rep(list(seq(1,ncol(D))),ncol(D)))
But obviously this does not address B, as it goes through each possible iteration. What if my dataset was very large, let's say 15 or even more columns?
Keeping in mind that my parameters can only be positive integers (i.e. they are column numbers), is there an R algorithm that would allow me to find the best column order (or at least a good approximation) within a reasonable amount of time (e.g. 1-2 days), when I'm dealing with much larger datasets? This may look like a silly example, but it emulates very well the problem I'm trying to solve. I've tried optim()
with method="SANN"
, but got nowhere. Unfortunately I have very little experience so do let me know if you think this is an unworkable problem. Just to start with an easier dataset (few rows but lots of columns) problem, do you think it's possible to find the best column order as shown above for D2 by using some kind of clever optimization?
#D2
D<-cbind(D,D,D,D,D)
ncol(D)
Dx<-cbind(Dx,Dx,Dx,Dx,Dx)
#examples
f(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
f(c(13,2,4,3,5,6,7,8,9,10,11,12,1,14,15))
EDIT: My main interest is in understanding how to use optimization algorithms that use a series of unique positive integrals (basically ranks) in the search process, rather than solving this particular problem. I've used a simple example in this case so that it's easy to replicate, but the two datasets I'm comparing often differ in number of rows and other aspects which I'm not detailing here....the distance function I'm building handles this well so understanding how to apply an optimization algorithm (e.g. Genetic Algorithm was suggested below) to the function f above using D2 is therefore my main problem at the moment.
回答1:
If your objective function f
must truly be seen as a black box, then we'll need to resort to approximate approaches such as a genetic algorithm. Here is a solution using the gaoptim
package, which is maximizing f(p)
among all permutations p
of the columns of Dx
:
library(gaoptim)
myGA = GAPerm(f, ncol(Dx), popSize=10)
myGA$evolve(10)
myGA
# Results for 10 Generations:
# Mean Fitness:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 95.0 107.4 115.6 112.4 118.3 120.6
#
# Best Fitness:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 125 125 125 125 125 125
#
# Best individual:
# [1] 3 1 2
#
# Best fitness value:
# [1] 125
In this case it found the best possible solution, with objective value 125, though in general there are no guarantees about the quality of the solution that will be returned by a genetic algorithm.
回答2:
As I understand it, you are searching for the best assignments from a set of candidate columns to a set of target columns, and there is some cost associated with matching a candidate column to a target column. You are searching for the one-to-one matching that minimizes overall cost.
This is called the assignment problem, which is a classical problem in operations research. Your approach of grid search will have an exponential runtime (you need to search all possible assignments), but there are much more efficient approaches to this problem, many of which rely on linear programming.
You can solve your problem in R using the lp.assign
function from the lpSolve
package, providing your pairwise distances between the columns of your matrix:
# Build cost matrix
costs <- as.matrix(dist(t(D), method="manhattan"))
costs
# X1 X2 X3
# X1 0 7 11
# X2 7 0 6
# X3 11 6 0
# Solve assignment problem
library(lpSolve)
solution <- lp.assign(costs)$solution
apply(solution > 0.999, 2, which)
# [1] 1 2 3
This means we selected permutation 1, 2, 3 as the most promising.
来源:https://stackoverflow.com/questions/22022680/optimization-with-positive-integer-parameters