Text Mining R Package & Regex to handle Replace Smart Curly Quotes

前端 未结 3 579
夕颜
夕颜 2020-12-04 00:37

I\'ve got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I\'m aware of is to remove those

相关标签:
3条回答
  • 2020-12-04 00:56

    There's a function in {proustr} to normalize punctuation, called pr_normalize_punc() :

    https://github.com/ColinFay/proustr#pr_normalize_punc

    It turns :

     => ″‶«  »“”`´„“ into "
     => ՚ ’ into ' 
     => … into ...
    

    For example :

    library(proustr)
    a <- data.frame(text = "Il l՚a dit : « La ponctuation est chelou » !")
    pr_normalize_punc(a, text)
    # A tibble: 1 x 1
                                                text
    *                                          <chr>
    1 "Il l'a dit : \"La ponctuation est chelou\" !"
    

    For your text :

    pr_normalize_punc(data.frame( text = "You don‘t get “your” money’s worth"), text)
    # A tibble: 1 x 1
                                        text
    *                                  <chr>
    1 "You don‘t get \"your\" money's worth"
    

    0 讨论(0)
  • 2020-12-04 01:03

    Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:

    > gsub("[“”]", "\"", gsub("[‘’]", "'", text))
    [1] "You don't get \"your\" money's worth"
    

    See the online R demo. Tested in both Linux and Windows, and works the same.

    The [“”] construct is a positive character class that matches any single char defined in the class.

    To normalize all chars similar to double quotes, you might want to use

    > sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
    > dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
    > res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8"))) 
    > cat(res, sep="\n")
    You don't get "your" money's worth
    

    Here, [«»““”„‟≪≫《》〝〞〟"″‶] matches

    «   00AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
    »   00BB  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    “   05F4  HEBREW PUNCTUATION GERSHAYIM
    “   201C  LEFT DOUBLE QUOTATION MARK
    ”   201D  RIGHT DOUBLE QUOTATION MARK
    „   201E  DOUBLE LOW-9 QUOTATION MARK
    ‟   201F  DOUBLE HIGH-REVERSED-9 QUOTATION MARK
    ≪  226A  MUCH LESS-THAN
    ≫  226B  MUCH GREATER-THAN
    《  300A  LEFT DOUBLE ANGLE BRACKET
    》  300B  RIGHT DOUBLE ANGLE BRACKET
    〝  301D  REVERSED DOUBLE PRIME QUOTATION MARK
    〞  301E  DOUBLE PRIME QUOTATION MARK
    〟  301F  LOW DOUBLE PRIME QUOTATION MARK
    "  FF02  FULLWIDTH QUOTATION MARK
    ″   2033  DOUBLE PRIME
    ‶   2036  REVERSED DOUBLE PRIME
    

    The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:

    ʻ  02BB  MODIFIER LETTER TURNED COMMA
    ʼ  02BC  MODIFIER LETTER APOSTROPHE
    ʽ  02BD  MODIFIER LETTER REVERSED COMMA
    ٬  066C  ARABIC THOUSANDS SEPARATOR
    ‘  2018  LEFT SINGLE QUOTATION MARK
    ’  2019  RIGHT SINGLE QUOTATION MARK
    ‚  201A  SINGLE LOW-9 QUOTATION MARK
    ‛  201B  SINGLE HIGH-REVERSED-9 QUOTATION MARK
    ՚   055A  ARMENIAN APOSTROPHE
    ︐  FE10  PRESENTATION FORM FOR VERTICAL COMMA
    
    0 讨论(0)
  • 2020-12-04 01:08

    We can use gsub here for a base R option. Replace each curly quoted term at a time.

    text <- "You don‘t get “your” money’s worth"
    new_text <- gsub("“(.*?)”", "\"\\1\"", text)
    new_text <- gsub("’", "'", new_text)
    new_text
    [1] "You don‘t get \"your\" money's worth"
    

    I have assumed here that your curly quotes are always balanced, i.e. they always wrap a word. If not, then you might have to do more work.

    Doing a blanket replacement of opening/closing double curly quotes may not play out as intended, if you want them to remain as is when not quoting a word.

    Demo

    0 讨论(0)
提交回复
热议问题