问题
I cooked a df of paramStrings over several records:
idName Str
1 Аэрофлот_Эконом 95111000210102121111010100111000100110101001
2 Аэрофлот_Комфорт 95111000210102121111010100111000100110101001
3 Аэрофлот_Бизнес 96111000210102121111010100111000100110101001
4 Трансаэро_Дисконт 26111000210102120000010100001010000010001000
5 Трансаэро_Туристический 26111000210002120000010100001010000010001000
6 Трансаэро_Эконом 26111000210002120000010100001010000010001000
Now I need to compare each one against others with a levenshtainDist, which works as a function(str1,str2), so I need obviously double loop for that. However, I am pretty sure there shall be a neat vectorised (apply/lapply/sapply) way of doing that, however I couldn't find any similar solutions...
回答1:
The function adist
computes a generalized Levenshtein distance. Is that what you need?
Assuming you have your data in a data.frame, using: adist(mydf$Str)
will return a matrix with the distances between each pair of the Str
column.
回答2:
Since you have a data.frame I think the best way to do a double loop is a lapply
/sapply
double loop which works great with data.frames
:
For example:
df1 <- data.frame(a=1:20,b=1:20) #example dataframe
a <- data.frame(lapply(1:nrow(df1), function(x) {
sapply(1:nrow(df1), function(y) {
sum( df1[x,2], df1[y,2]) #I just add the two cells (I only use the second column here for the demonstration) / replace with your function
}
)
}
)
)
colnames(a) <- 1:20 #change names
The first lapply
will return nrow(df1)
lists and inside each list will be a vector of nrow(df1)
observations (the evaluation of the function). This means that you will have a nrow(df1)
xnrow(df1)
list which is very convenient to convert into a data.frame
as I do above. Thus you have a nrow(df1)
xnrow(df1)
data.frame
.
The output of the above:
> str(a)
'data.frame': 20 obs. of 20 variables:
$ 1 : int 2 3 4 5 6 7 8 9 10 11 ...
$ 2 : int 3 4 5 6 7 8 9 10 11 12 ...
$ 3 : int 4 5 6 7 8 9 10 11 12 13 ...
$ 4 : int 5 6 7 8 9 10 11 12 13 14 ...
$ 5 : int 6 7 8 9 10 11 12 13 14 15 ...
$ 6 : int 7 8 9 10 11 12 13 14 15 16 ...
$ 7 : int 8 9 10 11 12 13 14 15 16 17 ...
$ 8 : int 9 10 11 12 13 14 15 16 17 18 ...
$ 9 : int 10 11 12 13 14 15 16 17 18 19 ...
$ 10: int 11 12 13 14 15 16 17 18 19 20 ...
$ 11: int 12 13 14 15 16 17 18 19 20 21 ...
$ 12: int 13 14 15 16 17 18 19 20 21 22 ...
$ 13: int 14 15 16 17 18 19 20 21 22 23 ...
$ 14: int 15 16 17 18 19 20 21 22 23 24 ...
$ 15: int 16 17 18 19 20 21 22 23 24 25 ...
$ 16: int 17 18 19 20 21 22 23 24 25 26 ...
$ 17: int 18 19 20 21 22 23 24 25 26 27 ...
$ 18: int 19 20 21 22 23 24 25 26 27 28 ...
$ 19: int 20 21 22 23 24 25 26 27 28 29 ...
$ 20: int 21 22 23 24 25 26 27 28 29 30 ...
You could even add that to a function and make a generic way of double-looping.
P.S. please keep in mind that using any function of the family apply
is not vectorised but works better than a for-loop
.
回答3:
Another way is to compute the combinations of rows you want to compare and then use 'mapply'. I am assuming that you want to compare two rows at a time from your matrix:
# get combinations
cbn <- combn(nrow(your_data), 2) # take 2 at a time
ans <- mapply(dist_function
, your_data[cbn[1, ], 1]
, your_data[cbn[2, ], 1]
)
来源:https://stackoverflow.com/questions/28088160/smartest-way-to-double-loop-over-a-data-frame-comparing-rows-to-each-other-with