Sanitize text for Mechanical Turk?

一世执手 提交于 2019-12-07 17:49:14

问题


Is there a pre-existing function to sanitize a data.frame's character columns for Mechanical Turk? Here's an example of a line that it's getting hung up on:

x <- "Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure"

I assume those are unicode characters, but MT is not letting me proceed with them in there. I can obviously regex these out pretty easily, but I use MT a decent bit and was hoping for a more generic solution to remove all non-ascii characters.

Edit

I can remove the encoding as follows:

> iconv(x,from="UTF-8",to="latin1",sub=".")
[1] "Duke......s B or C, no concomittant malignancy, ulcerative colitis, Crohn......s disease, renal, heart or liver failure"

But that still leaves me lacking a more generic solution for vectors that use non-utf8 encodings for any element.

> dput(vec)
c("Colorectal cancer patients Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure", 
"Patients with Parkinson\U3e32393cs Disease not already on levodopa", 
"hi")

Note that regular text is encoding "unknown", which has no conversion to "latin1", so simple solutions that use iconv fail. I have one attempt at a more nuanced solution below, but I'm not very happy with it.


回答1:


Going to take a stab at answering my own question and hopefully someone has a better way because I'm not convinced this will handle all funky text:

sanitize.text <- function(x) {
  stopifnot(is.character(x))
  sanitize.each.element <- function(elem) {
    ifelse(
      Encoding(elem)=="unknown",
      elem,
      iconv(elem,from=as.character(Encoding(elem)),to="latin1",sub="")
    )
  }
  x <- sapply(x, sanitize.each.element)
  names(x) <- NULL
  x
}

> sanitize.text(vec)
[1] "Colorectal cancer patients Dukes B or C, no concomittant malignancy, ulcerative colitis, Crohns disease, renal, heart or liver failure"
[2] "Patients with Parkinsons Disease not already on levodopa"                                                                              
[3] "hi"   

And a function to handle MT's other import quirks:

library(taRifx)
write.sanitized.csv <- function( x, file="", ... ) {
  sanitize.text <- function(x) {
    stopifnot(is.character(x))
    sanitize.each.element <- function(elem) {
      ifelse(
        Encoding(elem)=="unknown",
        elem,
        iconv(elem,from=as.character(Encoding(elem)),to="latin1",sub="")
      )
    }
    x <- sapply(x, sanitize.each.element)
    names(x) <- NULL
    x
  }
  x <- japply( df=x, sel=sapply(x,is.character), FUN=sanitize.text)
  colnames(x) <- gsub("[^a-zA-Z0-9_]", "_", colnames(x) )
  write.csv( x, file, row.names=FALSE, ... )
}

Edit

For lack of a better place to put this code, you can figure out which element of the character vector is causing problems that even the function above won't fix with something like:

#' Function to locate a non-ASCII character
#' @param txt A character vector
#' @return A logical of length length(txt) 
locateBadString <- function(txt) {
  vapply(txt, function(x) {
    class( try( substr( x, 1, nchar(x) ) ) )!="try-error"
  }, TRUE )
}

Edit2

I think that this should work:

iconv(x, to = "latin1", sub="")

Thanks to @Masoud in this answer: https://stackoverflow.com/a/20250920/636656



来源:https://stackoverflow.com/questions/11345212/sanitize-text-for-mechanical-turk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!