Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub

前端未结

关注

 2  736

I have searched but im not getting anywhere, probably because I\'m very new to R and not understanding (and getting intimidated) how the logic/syntax for pattern matching and re

相关标签:

2条回答

鱼传尺愫

2021-01-25 15:55

Twitter provides a set of libraries for working with tweet text. There is a reason for it since entities (the idiomatic term for the non-textual components of a tweet as specified by Twitter) are pretty "ugh" and Twitter hashtags have some esoteric rules and URLs are also kinda "ugh" to regex away. Plus there are infrequently used "cashtags" ($XYZ) for stock quotes.

Unfortunately, Twitter does not have an R library, Python library or proper C[++] library, but we can use rJava for this:

library(rJava)

Gather dependencies:

c(
  "http://central.maven.org/maven2/com/twitter/twittertext/twitter-text/2.0.10/twitter-text-2.0.10.jar", 
  "http://central.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.9.1/jackson-dataformat-yaml-2.9.1.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.8.7/jackson-databind-2.8.7.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.8.1/jackson-core-2.8.1.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.8.1/jackson-annotations-2.8.1.jar"
) -> deps

# download if necessary
if (!file.exists(deps[1])) { # assume we need them all if one is missing
  download.file(deps, basename(deps))
}

Init JVM

.jinit(force.init = TRUE)

Add dependent classes:

for (cp in basename(deps)) .jaddClassPath(cp)

Your sample data:

tweet <- ("\"If you like your doctor, you can keep your doctor.\" - #Obama 
#GunControl #GunControlNow pic.twitter.com/JpLpkj2LHB I don't know if
{Michelle #Obama} noticed, but I am not White & I am Not Male.
pic.twitter.com/TPplBj8ovg . @Eminem being honored by #Obama for his
rap battle win against @POTUS pic.twitter.com/YaYIuYWGlc")

Make the Java extractor function usable from R:

extractor <- new(J("com.twitter.twittertext.Extractor"))

We're eventually going to want to iterate over the start/end indices for all the identified entities so extract them all and make them something we can iterate over in R:

entities <- extractor$extractEntitiesWithIndices(tweet)$toArray()

Since we're working with indices of the entities we'll need a vector of the length of the tweet to create markers for extraction, defaulting to extracting all of them:

to_extract <- rep(TRUE, nchar(tweet))

Negate index ranges of the found entities:

for (i in seq_along(entities)) {
  to_extract[entities[[i]]$getStart():entities[[i]]$getEnd()] <- FALSE
}

Now, remove them (this character manipulation is not a strong point of R)

cat(paste0(strsplit(tweet, "")[[1]][to_extract], collapse=""))
## "If you like your doctor, you can keep your doctor." -  I don't know if
## {Michelle} noticed, but I am not White & I am Not Male. . being honored by for his
## rap battle win against

If you're new to R then ^^ is likely not the path for you. If you're on a crippled, legacy operating system like Windows where getting Java to work with R is not exactly unfraught with peril, ^^ is likely not the path for you.

However, naive regex-ing will likely end up mangling as well as extracting.

0 讨论(0)

眼角桃花

2021-01-25 16:16
Here is a solution which seems to be working (see below for caveats):
```
# x is your input text
gsub("#[A-Za-z0-9]+|@[A-Za-z0-9]+|\\w+(?:\\.\\w+)*/\\S+", "", x)

[1] "\"If you like your doctor, you can keep your doctor.\" -
    I don't know if {Michelle } noticed, but I am not White & I am Not Male.  .
    being honored by  for his rap battle win against  "
```
Note that this assumes that your URLs would always be of the form pic.twitter.com/TPplBj8ovg. That is, there would one or more domain components, one item in the path, and no leading protocol. In general, to match any URL, we would have to use a much more complicated pattern.
0 讨论(0)
发布评论:

提交评论
- 加载中...