I have searched but im not getting anywhere, probably because I\'m very new to R and not understanding (and getting intimidated) how the logic/syntax for pattern matching and re
Twitter provides a set of libraries for working with tweet text. There is a reason for it since entities (the idiomatic term for the non-textual components of a tweet as specified by Twitter) are pretty "ugh" and Twitter hashtags have some esoteric rules and URLs are also kinda "ugh" to regex away. Plus there are infrequently used "cashtags" ($XYZ) for stock quotes.
Unfortunately, Twitter does not have an R library, Python library or proper C[++] library, but we can use rJava
for this:
Gather dependencies:
) -> deps
# download if necessary
if (!file.exists(deps[1])) { # assume we need them all if one is missing
download.file(deps, basename(deps))
Init JVM
.jinit(force.init = TRUE)
Add dependent classes:
for (cp in basename(deps)) .jaddClassPath(cp)
Your sample data:
tweet <- ("\"If you like your doctor, you can keep your doctor.\" - #Obama
#GunControl #GunControlNow pic.twitter.com/JpLpkj2LHB I don't know if
{Michelle #Obama} noticed, but I am not White & I am Not Male.
pic.twitter.com/TPplBj8ovg . @Eminem being honored by #Obama for his
rap battle win against @POTUS pic.twitter.com/YaYIuYWGlc")
Make the Java extractor function usable from R:
extractor <- new(J("com.twitter.twittertext.Extractor"))
We're eventually going to want to iterate over the start/end indices for all the identified entities so extract them all and make them something we can iterate over in R:
entities <- extractor$extractEntitiesWithIndices(tweet)$toArray()
Since we're working with indices of the entities we'll need a vector of the length of the tweet to create markers for extraction, defaulting to extracting all of them:
to_extract <- rep(TRUE, nchar(tweet))
Negate index ranges of the found entities:
for (i in seq_along(entities)) {
to_extract[entities[[i]]$getStart():entities[[i]]$getEnd()] <- FALSE
Now, remove them (this character manipulation is not a strong point of R)
cat(paste0(strsplit(tweet, "")[[1]][to_extract], collapse=""))
## "If you like your doctor, you can keep your doctor." - I don't know if
## {Michelle} noticed, but I am not White & I am Not Male. . being honored by for his
## rap battle win against
If you're new to R then ^^ is likely not the path for you. If you're on a crippled, legacy operating system like Windows where getting Java to work with R is not exactly unfraught with peril, ^^ is likely not the path for you.
However, naive regex-ing will likely end up mangling as well as extracting.
Here is a solution which seems to be working (see below for caveats):
# x is your input text
gsub("#[A-Za-z0-9]+|@[A-Za-z0-9]+|\\w+(?:\\.\\w+)*/\\S+", "", x)
[1] "\"If you like your doctor, you can keep your doctor.\" -
I don't know if {Michelle } noticed, but I am not White & I am Not Male. .
being honored by for his rap battle win against "
Note that this assumes that your URLs would always be of the form pic.twitter.com/TPplBj8ovg
. That is, there would one or more domain components, one item in the path, and no leading protocol. In general, to match any URL, we would have to use a much more complicated pattern.