问题
I am trying to implement KNN classifier in R from scratch on iris data set and as a part of this i have written a function to calculate the Euclidean distance. Here is my code.
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
distance <- 0
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1][i])^2 + (k[,2][i] - unk[,2][i])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
However, when i call the function it's returning the first value correctly and rest as NA. Could anyone show where i could have gone wrong with the code? Thanks in advance.
回答1:
The aim is to calculate the distance between the ith row of known_data, and the single unknown_data point.
How to fix your code
When you calculate distance[i]
, you're trying to access the ith row of the unknown data point, which doesn't exits, and is hence NA
. I believe your code should run fine if you make the following edits:
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
# Make distance a vector [although not technically required]
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
# Change unk[,1][i] to unk[1,1] and similarly for unk[,2][i]
distance[i] <- sqrt((k[,1][i] - unk[1,1])^2 + (k[,2][i] - unk[1,2])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
One final note - in the version of R I'm using, the known dataset uses a Species
as opposed to Class
column
An alternative method
As suggested by @Roman Luštrik, the entire aim of getting the Euclidean distances can be achieved with a simple one-liner:
sqrt((known_data[, 1] - unknown_data[, 1])^2 + (known_data[, 2] - unknown_data[, 2])^2)
This is very similar to the function you wrote, but does it in vectorised form, rather than through a loop, which is often a preferable way of doing things in R.
回答2:
The best and fastst way is using h2o package:
#load library
library(h2o)
#initialize the node
h2o.init()
#transform the df to h2o type
known_data<-as.h2o(known_data)
unknown_data<-as.h2o(unknown_data)
#create a matrix in which the distances are going to be record
matrix1<-h2o.createFrame(rows=nrow(known_data),cols=unknown_data)
#do a loop to calculate the distance between all the rows of both df
for(i in 1:nrow(unknown_data)){
matrix[,i]<-as.data.frame(h2o.distance(known_data, unknown_data[i,],"l2"))
}
来源:https://stackoverflow.com/questions/45780199/function-to-calculate-euclidean-distance-in-r