Levenshtein type algorithm with numeric vectors

后端 未结 1 1395
星月不相逢
星月不相逢 2020-12-11 05:44

I have two vectors with numeric values. Such as

v1 <- c(1, 3, 4, 5, 6, 7, 8)
v2 <- c(54, 23, 12, 53, 7, 8)

I would like to compute th

相关标签:
1条回答
  • 2020-12-11 05:59

    An integer vector can be seen as a single string encoded in UTF-32 (in which one Unicode code point is represented as a single 32-bit integer). You may obtain an "ordinary" string, just by converting such a vector to UTF-8 with intToUtf8.

    intToUtf8(c(65, 97))
    ## [1] "Aa"
    

    By the way, adist does utf8ToInt (reverse op) by default on its inputs anyway. So internally, it computes the results according to integer vectors. No big hack.

    This is the solution.

    adist(intToUtf8(c(1, 3, 4, 5, 6, 7, 8)), intToUtf8(c(54, 23, 12, 53, 7, 8)), counts=TRUE)
    ##      [,1]
    ## [1,]    5
    ## attr(,"counts")
    ## , , ins
    ## 
    ##      [,1]
    ## [1,]    0
    ## 
    ## , , del
    ## 
    ##      [,1]
    ## [1,]    1
    ## 
    ## , , sub
    ## 
    ##      [,1]
    ## [1,]    4
    ## 
    ## attr(,"trafos")
    ##      [,1]     
    ## [1,] "SSSSDMM"
    

    The above code should work if at least all the numbers are strictly greater than 0. R treats Unicode code points quite liberally (in fact, too liberally, but in this case you're a winner), even the largest possible integer is accepted:

    utf8ToInt(intToUtf8(c(2147483647)))
    ## 2147483647
    

    If you have a vector with negative values, you may transform it somehow, e.g. with x <- x-min(x)+1.

    If you need different costs for insertion, removal, replacement, check out the adist's costs argument. There is also a package called stringdist, which included many other string metrics. The above scheme should also work there.

    0 讨论(0)
提交回复
热议问题