Genetic algorithms: fitness function for feature selection algorithm

随声附和 提交于 2019-12-05 03:16:21

This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.

The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.

Reducing to code is straightforward:

  1. Calculate the covariance matrix of your dataset (first remove the column that holds the response variable, i.e., probably the last one). If your dataset is m x n (columns x rows), then this covariance matrix will be n x n, with "1"s down the main diagonal.

  2. Next, perform an eigenvalue decomposition on this covariance matrix; this will give you the proportion of the total variability in the response variable, contributed by that eigenvalue (each eigenvalue corresponds to a feature, or column). [Note, singular-value decomposition (SVD) is often used for this step, but it's unnecessary--an eigenvalue decomposition is much simpler, and always does the job as long as your matrix is square, which covariance matrices always are].

  3. Your genetic algorithm will, at each iteration, return a set of candidate solutions (features subsets, in your case). The next task in GA, or any combinatorial optimization, is to rank those candiate solutions by their cost function score. In your case, the cost function is a simple summation of the eigenvalue proportion for each feature in that subset. (I guess you would want to scale/normalize that calculation so that the higher numbers are the least fit though.)

A sample calculation (using python + NumPy):

>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA

>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
     (4, 4)
>>> C
     array([[ 1.  , -0.11,  0.87,  0.82],
            [-0.11,  1.  , -0.42, -0.36],
            [ 0.87, -0.42,  1.  ,  0.96],
            [ 0.82, -0.36,  0.96,  1.  ]])

>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1))   # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
        print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))

   ev value  proportion    
   1   2.91   0.727
   2   0.92   0.953
   3   0.14   0.995
   4   0.02   1.000

so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!