I have been searching for a practical example of KNN implementation using weka, but all I find is too general for me to understand the data that it needs to be able to work (or
Its Fairly Simple. In order to understand why is it always /(69-35) and also /(150000-38000), You first need to understand what Normalization means.
Normalization:
Normalization usually means to scale a variable to have a values between 0 and 1.
The formula is as follows:
If you closely look at the denominator of the above formula, you will observe that it's the max value of all the number subtracted from the min value of all the number.
Now, Get back to you question... Look at the 5th line of the Question. Its says as follow.
The easiest and most common distance calculation is the "Normalized Euclidian Distance."
In you Age column, you can see that the min value is 35 and the max value is 69.Similarly, in your Income column you min value is 38k and max would be 150k.
This is the exact reason that you always have it /(69-35) and also /(150000-38000).
Hope you have understood it.
PEACE
KNN is a machine learning technique usually classified as an "Instance-Based predictor". It takes all instances of classified samples and draws them in a n-dimensional space.
Using algorithms such as Euclidean distance, KNN looks for the closest points in this n-dimensional space and estimates to which class it belongs based on these neighbors. If it is closer to blue dots, it is blue, if its closer to red dots...
But now, how could we apply it to your problem?
Imagine that you only have two attributes, price and calories (2 dimensional space). You want to classify customers into three classes: fit, junk-food, gourmet. With this, you can offer a deal in a restaurant similar to the customer's preferences.
You have the following data:
+-------+----------+-----------+
| Price | Calories | Food Type |
+-------+----------+-----------+
| $2 | 350 | Junk Food |
+-------+----------+-----------+
| $5 | 700 | Junk Food |
+-------+----------+-----------+
| $10 | 200 | Fit |
+-------+----------+-----------+
| $3 | 400 | Junk Food |
+-------+----------+-----------+
| $8 | 150 | Fit |
+-------+----------+-----------+
| $7 | 650 | Junk Food |
+-------+----------+-----------+
| $5 | 120 | Fit |
+-------+----------+-----------+
| $25 | 230 | Gourmet |
+-------+----------+-----------+
| $12 | 210 | Fit |
+-------+----------+-----------+
| $40 | 475 | Gourmet |
+-------+----------+-----------+
| $37 | 600 | Gourmet |
+-------+----------+-----------+
Now, let's see it plotted in a 2D space:
What happens next?
For every new entry, the algorithm calculates the distance to all dots (instances) and find the k nearest ones. From the class of these k nearest ones, it defines the class of the new entry.
Take k = 3 and values $15 and 165 cal. Let's find the 3 nearest neighbors:
There's where the Distance formula comes on. It actually makes this computation for every dot. These distances are then "ranked" and the k closest ones compose the final class.
Now, Why the values /(69-35) and also /(150000-38000)? As mentioned in other answers, this is due to normalization. Our example uses price and cal. As seen, calories are in a greater order than money (more units per value). To avoid inbalances, such as the one that can make calories more valuable for class than price (which would kill Gourmet class, for example), there's the need to make all attributes similarly important, hence the use of normalization.
Weka abstracts that for you, but you can visualize it as well. See an example of visualization from a project I made for a Weka ML course:
Notice that, since there are many more than 2 dimensions, there are a lot of plots, but the idea is similar.
Explaining the code:
public class Wekatest {
public static void main(String[] args) {
//These two ArrayLists are the inputs of your algorithm.
//atts are the attributes that you're going to pass for training, usually called X.
//classVal is the target class that is to be predicted, usually called y.
ArrayList<Attribute> atts = new ArrayList<>();
ArrayList<String> classVal = new ArrayList<>();
//Here you initiate a "dictionary" of all distinct types of restaurants that can be targeted.
classVal.add("A");
classVal.add("B");
classVal.add("C");
classVal.add("D");
classVal.add("E");
classVal.add("F");
// The next two lines initiate the attributes, one made of "content" and other pertaining to the class of the already labeled values.
atts.add(new Attribute("content", (ArrayList<String>) null));
atts.add(new Attribute("@@class@@", classVal));
//This loads a Weka object of data for training, using attributes and classes from a file "TestInstancePlatos" (or should happen).
//dataRaw contains a set of previously labelled instances that are going to be used do "train the model" (kNN actually doesn't tain anything, but uses all data for predictions)
Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);
//Here you're starting new instances to test your model. This is where you can substitute for new inputs for production.
double[] instanceValue1 = new double[dataRaw.numAttributes()];
//It looks you only have 2 attributes, a food product and a rating maybe.
instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
instanceValue1[1] = 0;
//You're appending this new instance to the model for evaluation.
dataRaw.add(new DenseInstance(1.0, instanceValue1));
double[] instanceValue2 = new double[dataRaw.numAttributes()];
instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
instanceValue2[1] = 1;
dataRaw.add(new DenseInstance(1.0, instanceValue2));
double[] instanceValue3 = new double[dataRaw.numAttributes()];
instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
instanceValue3[1] = 2;
dataRaw.add(new DenseInstance(1.0, instanceValue3));
double[] instanceValue4 = new double[dataRaw.numAttributes()];
instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
instanceValue4[1] = 3;
dataRaw.add(new DenseInstance(1.0, instanceValue4));
double[] instanceValue5 = new double[dataRaw.numAttributes()];
instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
instanceValue5[1] = 4;
dataRaw.add(new DenseInstance(1.0, instanceValue5));
// After adding 5 instances, time to test:
System.out.println("---------------------");
//Load the algorithm with data.
weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
//You're predicting the class of value 0 of your data raw values. You're asking the answer among 1 neighbor (second attribute)
try {
Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
//You will get a value among A and F, that are the classes passed.
System.out.println(nearestInstances);
} catch (Exception e) {
e.printStackTrace();
}
}
}
How should you do it?
-> Gather data.
-> Define a set of attributes that help you to predict which cousine you have (ex.: prices, dishes or ingredients (have one attribute for each dish or ingredient).
-> Organize this data.
-> Define a set of labels.
-> Manually label a set of data.
-> Load labelled data to KNN.
-> Label new instances by passing their attributes to KNN. It'll return you the label of the k nearest neighbors (good values for k are 3 or 5, have to test).
-> Have fun!