问题
I would like to calculate the similarity (Numerical measure of how alike 2 data objects are - in this case, how alike 2 rows are) of each row in a table, and the table will be like:
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc
I tried many different ways on the internet, but most of them are for calculating similarity for a matrix. Obviously, we can easily tell the first and second row are "most similar" because they only have one different variable, but I need a one-time way to compare each row of this table.
The outcome may be like: the similarity of the first and the second row is 0.983.
回答1:
This essentially calculates the proportion of elements that are the same. First, I create the data frame:
# Create data frame
data <- read.table(text = "vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc", sep = ",")
Next, I load dplyr
.
# Load dplyr library
library(dplyr)
This is the function that does all the work.
# Function for comparing rows
row_cf <- function(x, y, df){
sum(df[x,] == df[y,])/ncol(df)
}
And here it is applied.
# 1) Create all possible row combinations
# 2) Rename the columns for readability
# 3) Run through each row
# 4) Calculate similarity
res <- expand.grid(1:nrow(data), 1:nrow(data)) %>%
rename(row_1 = Var1, row_2 = Var2) %>%
rowwise() %>%
mutate(similarity = row_cf(row_1, row_2, data))
# Results
# row_1 row_2 similarity
# 1 1 1 1.0000000
# 2 2 1 0.8571429
# 3 3 1 0.7142857
# 4 4 1 0.7142857
# 5 5 1 0.5714286
# 6 6 1 0.5714286
# 7 7 1 0.7142857
# 8 8 1 0.5714286
# 9 9 1 0.5714286
# 10 1 2 0.8571429
# 11 2 2 1.0000000
# 12 3 2 0.7142857
# 13 4 2 0.5714286
# 14 5 2 0.7142857
# 15 6 2 0.5714286
# 16 7 2 0.5714286
# 17 8 2 0.7142857
# 18 9 2 0.5714286
# 19 1 3 0.7142857
# 20 2 3 0.7142857
# 21 3 3 1.0000000
# 22 4 3 0.7142857
# 23 5 3 0.7142857
# 24 6 3 0.8571429
# 25 7 3 0.7142857
# 26 8 3 0.7142857
# 27 9 3 0.8571429
# 28 1 4 0.7142857
# 29 2 4 0.5714286
# 30 3 4 0.7142857
# 31 4 4 1.0000000
# 32 5 4 0.8571429
# 33 6 4 0.8571429
# 34 7 4 0.8571429
# 35 8 4 0.7142857
# 36 9 4 0.7142857
# 37 1 5 0.5714286
# 38 2 5 0.7142857
# 39 3 5 0.7142857
# 40 4 5 0.8571429
# 41 5 5 1.0000000
# 42 6 5 0.8571429
# 43 7 5 0.7142857
# 44 8 5 0.8571429
# 45 9 5 0.7142857
# 46 1 6 0.5714286
# 47 2 6 0.5714286
# 48 3 6 0.8571429
# 49 4 6 0.8571429
# 50 5 6 0.8571429
# 51 6 6 1.0000000
# 52 7 6 0.7142857
# 53 8 6 0.7142857
# 54 9 6 0.8571429
# 55 1 7 0.7142857
# 56 2 7 0.5714286
# 57 3 7 0.7142857
# 58 4 7 0.8571429
# 59 5 7 0.7142857
# 60 6 7 0.7142857
# 61 7 7 1.0000000
# 62 8 7 0.8571429
# 63 9 7 0.8571429
# 64 1 8 0.5714286
# 65 2 8 0.7142857
# 66 3 8 0.7142857
# 67 4 8 0.7142857
# 68 5 8 0.8571429
# 69 6 8 0.7142857
# 70 7 8 0.8571429
# 71 8 8 1.0000000
# 72 9 8 0.8571429
# 73 1 9 0.5714286
# 74 2 9 0.5714286
# 75 3 9 0.8571429
# 76 4 9 0.7142857
# 77 5 9 0.7142857
# 78 6 9 0.8571429
# 79 7 9 0.8571429
# 80 8 9 0.8571429
# 81 9 9 1.0000000
来源:https://stackoverflow.com/questions/52650932/how-to-calculate-the-similarity-for-all-the-rows-in-a-table-in-r