Split parts of strings into a list column and then make a vector column

前端 未结 1 1606
自闭症患者
自闭症患者 2021-01-22 16:49

I\'m trying to run a function on every row fulfilling a certain criterium, which returns a data frame - the idea is then to take the list of data frames and rbindlist them toget

相关标签:
1条回答
  • 2021-01-22 17:21

    We can rewrite this making it much more compact, eschewing the function. We'll do it in two steps, first we'll create a new column which holds a list (data.table columns can hold almost anything, even embedded data.tables), and then we'll extract these into a new data.table.

    url_pattern <- "http[^([:blank:]|\\\"|<|&|#\n\r)]+"
    
    db[(has_url), urls := str_match_all(text, url_pattern)]
    urls <- db[(has_url), list(url=unlist(urls)), by=id]
    

    Note that we use (has_url) instead of has_url == T, this uses binary indexing which is much faster (although in this case, most of the time is taken up by str_match_all, so it won't make that much difference). Make sure you use the () though, otherwise it won't work.

    The second line creates db$urls, which is a list of urls. The third line generates a new data.table, which has one entry for each URL, with the ID field linking it back to the forum post it came from.

    db has 146k rows, db[(has_url),] has 11k rows, and urls has 30k rows (some posts have several urls).

    Sample output from head(urls):

    id  url
    14  http://reganmian.net/blog
    44  http://vg.no
    59  http://koran.co.id
    

    Update, simple reproducible example

    Let's first generate some data

    texts = c("Stian fruit:apple, fruit:banana and fruit:pear",
              "Peter fruit:apple",
              "fruit:banana is delicious",
              "I don't agree")
    DT <- data.table(text = texts, id=1:length(texts))
    
    DT
                                                 text id
    1: Stian fruit:apple, fruit:banana and fruit:pear  1
    2:                              Peter fruit:apple  2
    3:                      fruit:banana is delicious  3
    4:                                  I don't agree  4
    

    We want to grab all the "fruits" from the text column (each row might have one, several or no fruits). We first use str_match_all to put a list of individual fruits into a new column.

    pattern <- "fruit:\\S*"
    
    DT[, fruit_list := str_match_all(text, pattern)]
    

    Now the fruit field looks like this:

    > DT[1]$fruit_list
    [[1]]
         [,1]          
    [1,] "fruit:apple,"
    [2,] "fruit:banana"
    [3,] "fruit:pear"  
    

    Now we want to extract the fruits into a new table, with one row per fruit, keeping the link back to the ID

    fruits <- DT[, list(fruit=unlist(fruit_list)), by=id]
    

    And the result

    > fruits
       id        fruit
    1:  1 fruit:apple,
    2:  1 fruit:banana
    3:  1   fruit:pear
    4:  2  fruit:apple
    5:  3 fruit:banana
    

    (thank you to Matthew Dowle and Ricardo Saporta on data.table-help mailing list)

    0 讨论(0)
提交回复
热议问题