Why does single-layer perceptron converge so slow without normalization, even when the margin is large?

问题

This question is totally re-written after I confirmed my results (the Python Notebook can be found here) with a piece of code written by someone else (can be found here). Here is that code instrumented by me to work with my data and to count epochs till convergence:

import numpy as np
from matplotlib import pyplot as plt

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

The training set consists of 15 points, with a large separation margin. The Perceptron algorithm finds a separator as shown below, but after as many as 122,346 epochs:

As the Wikipedia article explains, the number of epochs needed by the Perceptron to converge is proportional to the square of the size of the vectors and inverse-proportional to the square of the margin. In my data, the size of the vectors is large, but the margin is large as well.

I seek to understand why so many epochs are required.

Update: As per the request in the comments, I updated the code to plot the total errors of the last 100 epochs. Here is the plot:

P.S.:After scaling the features to be distributed as N(0,1), the algorithm converges after two epochs. However, I do not grasp why the algorithm would not converge in a reasonable amount of time even without such scaling.

回答1:

The problem you're facing could be summarized in a simple statement: the numbers of your example do not favor convergence or your perceptron.

Honestly I'm not sure what exactly can be learned from your synthetic example; anyway, please don't take me wrong, it is always so good to play around in the laboratory and learn from it. There's a number of recommendations that are generic when fitting neural nets, and some of them are reflected in comments to your question. This paper is old but good and you'll see it referenced around.

About your problem in particular: it is not really a matter of standarizing but centering. The problem is that when you re-evaluate your weights

self.W = self.W + self.lr * e * x

your error term e will be either +1 or -1 depending on the example that you mis-classify (e.g. +1 if the example target is 1 and it is classified as 0), but mostly +1 since there are more positive classes, and your coordinates in x and mostly positive values. So, most of the times, you will be adding up to your weights, not subtracting, and this way it is obviously quite slow for the perceptron to find a solution.

If you just scale your X

X = scale(X, with_mean=True, with_std=False)

convergence takes 1461 epochs only.

The classifier looks like this

and it makes sense that the boundary is very closed to the positive classes, since there are many of them; as soon as the perceptron gets all the positive classes right, the job is nearly done.

Additionally, if you rebalance your data -I've done it in this lazy way as a test

trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                        [306.23240054, 128.3794866 ,   1.        ],
                        [216.67811217, 148.58167262,   1.        ],
                        [223.64431813, 197.75745016,   1.        ],
                        [486.68209275,  96.09115377,   1.        ],
                        [400.71323154, 125.18183395,   1.        ],
                        [288.87299305, 204.52217766,   1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [ 14.95991122, 185.92681911,   1.        ],
                        [393.92908798, 193.40527965,   1.        ],
                        [494.15988362, 179.23456285,   1.        ],
                        [235.59039363, 175.50868526,   1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [ 76.52735621, 208.33663341,   1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ]])

it takes 2 epochs (surprisingly) to get this classifier

Hope it helps.

EDIT after comments

(1) About errors that are adding up or subtracting only

Let's take an example of the positive class

[279.25746446, 162.44072328,   1.        ]

For these, since d is equal to 0, e can only be 0 if the classifier gets it right and -1 if it gets it wrong.

e = d[i] - self.predict(x)

(predict returns either 0 or 1)

When adding up to the weight, it adds nothing if the classifier gets it right, and -1 * x * learning rate if wrong. For this example, assuming lr == 1, it will subtract exactly (1, 279.25746446, 162.44072328) if there is an error in this positive example.

Now, take a look to all the positive examples. If you don't transform the X, all coordinates have positive values, thus all the classification errors will subtract to the weights.

Now let's take a negative example:

[245.1492875 ,  55.75847006,  -1.        ]

For these, since d is equal to 1, e can only be 0 if the classifier gets it right and +1 if it gets it wrong. Again, all coordinates are positive, except for one coordinate in the 3rd negative example. Thus nearly all mistake for the negative class will be adding.

But there are only 3 examples of the negative class, and 12 of the positive class. Thus the errors will be mostly subtracting and not adding to the weights. (Sorry I've put it the other way around in my text before the edit). It's reasonable then to think that convergence will be slow if you do nothing, faster if you center the data. (One could even wonder how it converges.)

(2) About resampling

I meant to say that convergence with resampling (and centering) is surprisingly fast, 2 epochs. However it is reasonable that resampling makes convergence faster, since there is more balance between errors pulling the output to one direction or to the other.

Hope it is more clear now.

EDIT after more comments

I understand that maybe the importance of balance between samples and how they are pulling the solution is not really intuitive. Actually, the way I faced your question was probably the opposite: by looking at your loss function, and thinking about what the problem could be, and similar problems I faced in the past and intuitions I had, I thought about rebanlancing - then tried to relabalance and after to center the data and confirmed my intuitions about your loss function. Only afterwards I tried to build an explanation for you.

Of course, it is not that I process the loss function in my mind and known what it is doing. Anyway I would suggest that you build your own intuitions, since your target is learning, and you could do it this way: plot how the separation line moves epoch after epoch.

From your code:

labels = [1, 0]
labelColors = ['blue', 'green']

def showData(X, y, plt = plt): 
    colors = [(labelColors[0] if el == labels[0] else labelColors[1]) for el in y] 
    plt.scatter(X[:,0],X[:,1],c=colors)

def plotW(xs, w):
    plt.plot(xs, (w[0] + w[1] * xs)/-w[2], color = 'red', linewidth=4)

import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            showData(X, d)
            plotW(X[:,0], self.W)
            plt.show()
            if epoch == 100:
                break
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    X = scale(X, with_mean=True, with_std=False)
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

And compare the evolution of the line in the different setups. If you compare the first 100 epochs when centering versus not centering, you will see that when you do not center the data, the line tends to bump in a sort of a loop, while when centering the line moves more smoothly. (That's actually the same kind of effect you usually get when slowing down the learning rate, as some people suggested in comments.)

I don't mean to say that looking at those plots is analytical evidence for the behavior of your loss function. I don't even pretend that this a real answer to your question. But anyway, if it helps you build an intuition, then it will be worth it.

There's loads of work about convergence, which has been applied extensively in Deep Learning since it is a key issue, as you probably know. Sure you've heard about the different optimizers and how they affect convergence of a loss function that, in Deep Learning or in complex neural nets in general, is certainly difficult to understand and impossible to tackle analytically.

来源：https://stackoverflow.com/questions/59319643/why-does-single-layer-perceptron-converge-so-slow-without-normalization-even-wh

标签

python

machine-learning

classification

perceptron