How to view the nearest neighbors in R?

后端 未结 1 1135
耶瑟儿~
耶瑟儿~ 2021-01-31 20:44

Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/

相关标签:
1条回答
  • 2021-01-31 21:40

    1) You can get the nearest neighbors of a given row like so:

    k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
    indices <- attr(k, "nn.index")
    

    Then if you want the indices of the 10 nearest neighbors to row 20 in the training set:

    print(indices[20, ])
    

    (You'll get the 10 nearest neighbors because you selected k=10). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier):

    train <- read.csv("train.csv", header=TRUE)[1:1000, ]
    test <- read.csv("test.csv", header=TRUE)[1:1000, ]
    
    labels <- train[,1]
    train <- train[,-1]
    
    k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
    indices = attr(k, "nn.index")
    
    print(indices[20, ])
    # output:
    #  [1] 829 539 784 487 293 882 367 268 201 277
    

    Those are the indices within the training set of 1000 that are closest to the 20th row of the test set.

    2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this:

    closest.labels = apply(indices, 2, function(col) labels[col])
    

    You can then see the labels of the 10 closest points to the 20th training point like this:

    closest.labels[20, ]
    # [1] 0 0 0 0 0 0 0 0 0 0
    

    This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. knn simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer.

    ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this:

    dists = attr(k, "nn.dist")
    dists[20, ]
    # output:
    # [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730
    # [9] 1630.910 1667.014
    
    0 讨论(0)
提交回复
热议问题