I am in need of a function that extracts any type of bracket ie (), [], {} and the information in between. I created it and get it to do what I want but I get an annoying w
Suppose that brackets are not nested and that we have this test data:
x <- c("a (bb) [ccc]{d}e", "x[a]y")
Then using strapply
in gsubfn we have this two-line solution which first translates all parentheses and square brackets to brace brackets and then processes that:
library(gsubfn)
xx <- chartr("[]()", "{}{}", x)
s <- strapply(xx, "{([^}]*)}", c)
The result of the above is the following list:
> s
[[1]]
[1] "bb" "ccc" "d"
[[2]]
[1] "a"
Maybe this function is a little more straight-forward? Or at least more compact.
bracketXtract <-
function(txt, br = c("(", "[", "{", "all"), with=FALSE)
{
br <- match.arg(br)
left <- # what pattern are we looking for on the left?
if ("all" == br) "\\(|\\{|\\["
else sprintf("\\%s", br)
map <- # what's the corresponding pattern on the right?
c(`\\(`="\\)", `\\[`="\\]", `\\{`="\\}",
`\\(|\\{|\\[`="\\)|\\}|\\]")
fmt <- # create the appropriate regular expression
if (with) "(%s).*?(%s)"
else "(?<=%s).*?(?=%s)"
re <- sprintf(fmt, left, map[left])
regmatches(txt, gregexpr(re, txt, perl=TRUE)) # do it!
}
No need to lapply
; the regular expression functions are vectorized in that way. This fails with nested parentheses; likely regular expressions won't be a good solution if that's important. Here we are in action:
> txt <- c("I love chicken [unintelligible]!",
+ "Me too! (laughter) It's so good.[interupting]",
+ "Yep it's awesome {reading}.",
+ "Agreed.")
> bracketXtract(txt, "all")
[[1]]
[1] "unintelligible"
[[2]]
[1] "laughter" "interupting"
[[3]]
[1] "reading"
[[4]]
character(0)
This fits without trouble into a data.frame
.
> examp2 <- data.frame(var1=1:4)
> examp2$text <- c("I love chicken [unintelligible]!",
+ "Me too! (laughter) It's so good.[interupting]",
+ "Yep it's awesome {reading}.", "Agreed.")
> examp2$text2<-bracketXtract(examp2$text, 'all')
> examp2
var1 text text2
1 1 I love chicken [unintelligible]! unintelligible
2 2 Me too! (laughter) It's so good.[interupting] laughter, interupting
3 3 Yep it's awesome {reading}. reading
4 4 Agreed.
The warning you were seeing has to do with trying to stick a matrix into a data frame. I think the answer is "don't do that".
> df = data.frame(x=1:2)
> df$y = matrix(list(), 2, 2)
> df
x y
1 1 NULL
2 2 NULL
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
My thought had been to make 6 (implicitly vectorized) helper functions, but I will be studying Martin's code instead, since he is much better at this than I:
rm.curlybkt.no <-function(x) gsub("(\\{).*(\\})", "\\1\\2", x, perl=TRUE)
rm.rndbkt.no <- function(x) gsub("(\\().*(\\))", "\\1\\2", x, perl=TRUE)
rm.sqrbkt.no <- function(x) gsub("(\\[).*(\\])", "\\1\\2", x, perl=TRUE)
rm.rndbkt.in <- function(x) gsub("\\(.*\\)", "", x)
rm.curlybkt.in <- function(x) gsub("\\{.*\\}", "", x)
rm.sqrbkt.in <- function(x) gsub("\\[.*\\]", "", x)
Give this a shot. I prefer the stringr
package! :)
bracketXtract <- function(string, bracket = "all", include.bracket = TRUE){
# Load stringr package
require(stringr)
# Regular expressions for your brackets
rgx = list(square = "\\[\\w*\\]", curly = "\\{\\w*\\}", round = "\\(\\w*\\)")
rgx['all'] = sprintf('(%s)|(%s)|(%s)', rgx$square, rgx$curly, rgx$round)
# Ensure you have the correct bracket name
stopifnot(bracket %in% names(rgx))
# Find your matches
matches = str_extract_all(string, pattern = rgx[[bracket]])[[1]]
# Remove brackets from results if needed
if(!include.bracket)
matches = sapply(matches, function(m) substr(m, 2, nchar(m)-1))
unname(matches)
}
j <- "What kind of cheese isn't your cheese? {wonder} Nacho cheese! [groan] (Laugh)"
bracketXtract(j)
# [1] "{wonder}" "[groan]" "(Laugh)"
bracketXtract(j, bracket = "square")
# [1] "[groan]"
bracketXtract(j, include.bracket = F)
# [1] "wonder" "groan" "Laugh"