R strsplit with multiple unordered split arguments?

前端 未结 4 726
说谎
说谎 2020-12-01 02:53

Given a character string

test_1<-\"abc def,ghi klm\"
test_2<-\"abc, def ghi klm\"

I wish to obtain

\"abc\"
\"def\"
\"         


        
相关标签:
4条回答
  • 2020-12-01 03:09
     test_1<-"abc def,ghi klm"
     test_2<-"abc, def ghi klm"
     key_words <- c("abc","def","ghi")
     matches <- str_c(key_words, collapse ="|")
     str_extract_all(test_1, matches)
     str_extract_all(test_2, matches)
    
    0 讨论(0)
  • 2020-12-01 03:22

    Actually strsplit uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s" would be more to improve readability than of necessity):

    > strsplit(test_1, "\\, |\\,| ")
    [[1]]
    [1] "abc" "def" "ghi" "klm"
    
    > strsplit(test_2, "\\, |\\,| ")
    [[1]]
    [1] "abc" "def" "ghi" "klm"
    

    Without using both \\, and \\, (note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:

    > strsplit(test_2, "\\,\\s|\\,|\\s")
    [[1]]
    [1] "abc" "def" "ghi" "klm"
    

    @Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:

    > strsplit(test_2, "[, ]+")
    [[1]]
    [1] "abc" "def" "ghi" "klm"
    
    > strsplit(test_1, "[, ]+")
    [[1]]
    [1] "abc" "def" "ghi" "klm"
    
    0 讨论(0)
  • 2020-12-01 03:25

    In case you don't like regular expressions, you can call strsplit() multiple times:

    strsplits <- function(x, splits, ...)
    {
        for (split in splits)
        {
            x <- unlist(strsplit(x, split, ...))
        }
        return(x[!x == ""]) # Remove empty values
    }
    
    strsplits(test_1, c(" ", ","))
    # "abc" "def" "ghi" "klm"
    strsplits(test_2, c(" ", ","))
    # "abc" "def" "ghi" "klm"
    

    Updated for the added example

    strsplits(test_1, c("[[:punct:]]","[[:space:]]"))
    # "abc" "def" "ghi" "klm"
    strsplits(test_2, c("[[:punct:]]","[[:space:]]"))
    # "abc" "def" "ghi" "klm"
    

    But if you are going to use regular expressions, you might as well go with @DWin's approach:

    strsplit(test_1, "[[:punct:][:space:]]+")[[1]]
    # "abc" "def" "ghi" "klm"
    strsplit(test_2, "[[:punct:][:space:]]+")[[1]]
    # "abc" "def" "ghi" "klm"
    
    0 讨论(0)
  • You could go with strsplit(test_1, "\\W").

    0 讨论(0)
提交回复
热议问题