Remove all punctuation from text including apostrophes for tm package

前端 未结 1 1505
南旧
南旧 2021-01-24 00:01

I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm

1条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-24 00:06

    To remove all punctuation (including apostrophes and single quotes), you can just use gsub():

    x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
           "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
           "rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
           "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
           "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
           "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
           "ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
           "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
           "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
           "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")
    
    gsub("[[:punct:]]", "", x)
    #>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
    #>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
    #>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
    #>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
    #>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
    #>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
    #>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
    #>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
    #>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
    #> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
    

    gsub() replaces all occurrences of its first argument in its third argument with its second argument (see help("gsub")). Here, that means it replaces all occurrences in our vector x of any of the characters in the set [[:punct:]] with "" (remove them).

    What characters does that remove? From help("regex"):

    [:punct:]

        Punctuation characters:
        ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

    Update

    It appears this occurs because your apostrophes are like instead of like '. So, if you want to stick with tm::removePunctuation(), you can also use

    tm::removePunctuation(x, ucp = TRUE)
    #>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
    #>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
    #>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
    #>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
    #>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
    #>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
    #>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
    #>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
    #>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
    #> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
    

    0 讨论(0)
提交回复
热议问题