Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub

前端 未结 2 736
萌比男神i
萌比男神i 2021-01-25 15:14

I have searched but im not getting anywhere, probably because I\'m very new to R and not understanding (and getting intimidated) how the logic/syntax for pattern matching and re

相关标签:
2条回答
  • 2021-01-25 15:55

    Twitter provides a set of libraries for working with tweet text. There is a reason for it since entities (the idiomatic term for the non-textual components of a tweet as specified by Twitter) are pretty "ugh" and Twitter hashtags have some esoteric rules and URLs are also kinda "ugh" to regex away. Plus there are infrequently used "cashtags" ($XYZ) for stock quotes.

    Unfortunately, Twitter does not have an R library, Python library or proper C[++] library, but we can use rJava for this:

    library(rJava)
    

    Gather dependencies:

    c(
      "http://central.maven.org/maven2/com/twitter/twittertext/twitter-text/2.0.10/twitter-text-2.0.10.jar", 
      "http://central.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.9.1/jackson-dataformat-yaml-2.9.1.jar",
      "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.8.7/jackson-databind-2.8.7.jar",
      "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.8.1/jackson-core-2.8.1.jar",
      "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.8.1/jackson-annotations-2.8.1.jar"
    ) -> deps
    
    # download if necessary
    if (!file.exists(deps[1])) { # assume we need them all if one is missing
      download.file(deps, basename(deps))
    }
    

    Init JVM

    .jinit(force.init = TRUE)
    

    Add dependent classes:

    for (cp in basename(deps)) .jaddClassPath(cp)
    

    Your sample data:

    tweet <- ("\"If you like your doctor, you can keep your doctor.\" - #Obama 
    #GunControl #GunControlNow pic.twitter.com/JpLpkj2LHB I don't know if
    {Michelle #Obama} noticed, but I am not White & I am Not Male.
    pic.twitter.com/TPplBj8ovg . @Eminem being honored by #Obama for his
    rap battle win against @POTUS pic.twitter.com/YaYIuYWGlc")
    

    Make the Java extractor function usable from R:

    extractor <- new(J("com.twitter.twittertext.Extractor"))
    

    We're eventually going to want to iterate over the start/end indices for all the identified entities so extract them all and make them something we can iterate over in R:

    entities <- extractor$extractEntitiesWithIndices(tweet)$toArray()
    

    Since we're working with indices of the entities we'll need a vector of the length of the tweet to create markers for extraction, defaulting to extracting all of them:

    to_extract <- rep(TRUE, nchar(tweet))
    

    Negate index ranges of the found entities:

    for (i in seq_along(entities)) {
      to_extract[entities[[i]]$getStart():entities[[i]]$getEnd()] <- FALSE
    }
    

    Now, remove them (this character manipulation is not a strong point of R)

    cat(paste0(strsplit(tweet, "")[[1]][to_extract], collapse=""))
    ## "If you like your doctor, you can keep your doctor." -  I don't know if
    ## {Michelle} noticed, but I am not White & I am Not Male. . being honored by for his
    ## rap battle win against
    

    If you're new to R then ^^ is likely not the path for you. If you're on a crippled, legacy operating system like Windows where getting Java to work with R is not exactly unfraught with peril, ^^ is likely not the path for you.

    However, naive regex-ing will likely end up mangling as well as extracting.

    0 讨论(0)
  • 2021-01-25 16:16

    Here is a solution which seems to be working (see below for caveats):

    # x is your input text
    gsub("#[A-Za-z0-9]+|@[A-Za-z0-9]+|\\w+(?:\\.\\w+)*/\\S+", "", x)
    
    [1] "\"If you like your doctor, you can keep your doctor.\" -
        I don't know if {Michelle } noticed, but I am not White & I am Not Male.  .
        being honored by  for his rap battle win against  "
    

    Note that this assumes that your URLs would always be of the form pic.twitter.com/TPplBj8ovg. That is, there would one or more domain components, one item in the path, and no leading protocol. In general, to match any URL, we would have to use a much more complicated pattern.

    0 讨论(0)
提交回复
热议问题