How to get the first 10 words in a string in R?

后端 未结 4 1188
清歌不尽
清歌不尽 2020-12-17 00:00

I have a string in R as

x <- \"The length of the word is going to be of nice use to me\"

I want the first 10 words of the above specifi

相关标签:
4条回答
  • 2020-12-17 00:03

    Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.

    string_fun <- function(x) {
      ul = unlist(strsplit(x, split = "\\s+"))[1:10]
      paste(ul,collapse=" ")
    }
    
    string_fun(x)
    
    df <- read.table(text = "Keyword,City(Column Header)
    The length of the string should not be more than 10 is or are in,New York
    The Keyword should be of specific length is or are in,Los Angeles
                     This is an experimental basis program string is or are in,Seattle
                     Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
    
    df <- as.data.frame(df)
    

    Using apply (the function isn't doing anything in the second column)

    df$Keyword <- apply(df[,1:2], 1, string_fun)
    

    EDIT Probably this is a more general way to use the function.

    df[,1] <- as.character(df[,1])
    df$Keyword <- unlist(lapply(df[,1], string_fun))
    
    print(df)
    #                      Keyword                            City.Column.Header.
    # 1    The length of the string should not be more than            New York
    # 2  The Keyword should be of specific length is or are         Los Angeles
    # 3  This is an experimental basis program string is or             Seattle
    # 4      Please help me with getting only the first ten              Boston
    
    0 讨论(0)
  • 2020-12-17 00:03

    Regular expression (regex) answer using \w (word character) and its negation \W:

    gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
    
    1. ^ Beginning of the token (zero-width)
    2. ((\\w+\\W+){9}\\w+) Ten words separated by not-words.
      1. (\\w+\\W+){9} A word followed by not-a-word, 9 times
        1. \\w+ One or more word characters (i.e. a word)
        2. \\W+ One or more non-word characters (i.e. a space)
        3. {9} Nine repetitions
      2. \\w+ The tenth word
    3. .* Anything else, including other following words
    4. $ End of the token (zero-width)
    5. \\1 when this token found, replace it with the first captured group (the 10 words)
    0 讨论(0)
  • 2020-12-17 00:11
    x <- "The length of the word is going to be of nice use to me"
    head(strsplit(x, split = "\ "), 10)
    
    0 讨论(0)
  • 2020-12-17 00:18

    How about using the word function from Hadley Wickham's stringr package?

    word(string = x, start = 1, end = 10, sep = fixed(" "))

    0 讨论(0)
提交回复
热议问题