Popular kernel functions used in Support Vector Machines are Linear, Radial Basis Function and Polynomial. Can someone please expalin what this kernel function is in simple way
Let us start from the beggining. Support vector machine is a linear model and it always looks for a hyperplane to separate one class from another. I will focus on two-dimensional case because it is easier to comprehend and - possible to visualize to give some intuition, however bear in mind that this is true for higher dimensions (simply lines change into planes, parabolas into paraboloids etc.).
What kernels do is to change the definition of the dot product in the linear formulation. What does it mean? SVM works with dot products, for finite dimension defined as
. This more or less captures similarity between two vectors (but also a geometrical operation of projection, it is also heavily related to the angle between vectors). What kernel trick does is to change each occurence of
in math of SVM into K(x,y)
saying "K is dot product in SOME space", and there exists a mapping f_K for each kernel, such that K(x,y)=
the trick is, you do not use f_K directly, but just compute their dot products, which saves you tons of time (sometimes - infinite amount, as f_K(x) might have infinite number of dimensions). Ok, so what it meas for us? We still "live" in the space of x, not f_K(x). The result is quite nice - if you build a hyperplane in space of f_K, separate your data, and then look back at space of x (so you might say you project hyperplane back through f_K^{-1}) you get non-linear decision boundaries! Type of the boundary depends on f_K, f_K depends on K, thus, choice of K will (among other things) affect the shape of your boundary.
Here we in fact do not have any kernel, you just have "normal" dot product, thus in 2d your decision boundary is always line.
As you can see we can separate most of points correctly, but due to the "stiffness" of our assumption - we will not ever capture all of them.
Here, our kernel induces space of polynomial combinations of our features, up to certain degree. Consequently we can work with slightly "bended" decision boundaries, such as parabolas with degree=2
As you can see - we separated even more points! Ok, can we get all of them by using higher order polynomials? Lets try 4!
Unfortunately not. Why? Because polynomial combinations are not flexible enough. It will not "bend" our space hard enough to capture what we want (maybe it is not that bad? I mean - look at this point, it looks like an outlier!).
Here, our induced space is a space of Gaussian distributions... each point becomes probability density function (up to scaling) of a normal distribution. In such space, dot products are integrals (as we do have infinite number of dimensions!) and consequently, we have extreme flexibility, in fact, using such kernel you can separate everything (but is it good?)
Ok, so what are the main differences? I will now sort these three kernels under few measures
So which one to choose? It depends. Vapnik and Cortes (inventors of SVM) supported quite well the idea that you always should try to fit simpliest model possible and only if it underfits - go for more complex ones. So you should generally start with linear model (kernel in case of SVM) and if it gets really bad scores - switch to poly/rbf (however remember that it is much harder to work with them due to number of hyperparameters)
All images done using a nice applet on the site of libSVM - give it a try, nothing gives you more intuition then lots of images and interaction :-) https://www.csie.ntu.edu.tw/~cjlin/libsvm/