I\'m trying to run a function on every row fulfilling a certain criterium, which returns a data frame - the idea is then to take the list of data frames and rbindlist them toget
We can rewrite this making it much more compact, eschewing the function. We'll do it in two steps, first we'll create a new column which holds a list (data.table columns can hold almost anything, even embedded data.tables), and then we'll extract these into a new data.table.
url_pattern <- "http[^([:blank:]|\\\"|<|&|#\n\r)]+"
db[(has_url), urls := str_match_all(text, url_pattern)]
urls <- db[(has_url), list(url=unlist(urls)), by=id]
Note that we use (has_url) instead of has_url == T, this uses binary indexing which is much faster (although in this case, most of the time is taken up by str_match_all, so it won't make that much difference). Make sure you use the () though, otherwise it won't work.
The second line creates db$urls, which is a list of urls. The third line generates a new data.table, which has one entry for each URL, with the ID field linking it back to the forum post it came from.
db has 146k rows, db[(has_url),] has 11k rows, and urls has 30k rows (some posts have several urls).
Sample output from head(urls):
id url
14 http://reganmian.net/blog
44 http://vg.no
59 http://koran.co.id
Update, simple reproducible example
Let's first generate some data
texts = c("Stian fruit:apple, fruit:banana and fruit:pear",
"Peter fruit:apple",
"fruit:banana is delicious",
"I don't agree")
DT <- data.table(text = texts, id=1:length(texts))
DT
text id
1: Stian fruit:apple, fruit:banana and fruit:pear 1
2: Peter fruit:apple 2
3: fruit:banana is delicious 3
4: I don't agree 4
We want to grab all the "fruits" from the text column (each row might have one, several or no fruits). We first use str_match_all to put a list of individual fruits into a new column.
pattern <- "fruit:\\S*"
DT[, fruit_list := str_match_all(text, pattern)]
Now the fruit field looks like this:
> DT[1]$fruit_list
[[1]]
[,1]
[1,] "fruit:apple,"
[2,] "fruit:banana"
[3,] "fruit:pear"
Now we want to extract the fruits into a new table, with one row per fruit, keeping the link back to the ID
fruits <- DT[, list(fruit=unlist(fruit_list)), by=id]
And the result
> fruits
id fruit
1: 1 fruit:apple,
2: 1 fruit:banana
3: 1 fruit:pear
4: 2 fruit:apple
5: 3 fruit:banana
(thank you to Matthew Dowle and Ricardo Saporta on data.table-help mailing list)