Sanitize text for Mechanical Turk?

问题

Is there a pre-existing function to sanitize a data.frame's character columns for Mechanical Turk? Here's an example of a line that it's getting hung up on:

x <- "Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure"

I assume those are unicode characters, but MT is not letting me proceed with them in there. I can obviously regex these out pretty easily, but I use MT a decent bit and was hoping for a more generic solution to remove all non-ascii characters.

Edit

I can remove the encoding as follows:

> iconv(x,from="UTF-8",to="latin1",sub=".")
[1] "Duke......s B or C, no concomittant malignancy, ulcerative colitis, Crohn......s disease, renal, heart or liver failure"

But that still leaves me lacking a more generic solution for vectors that use non-utf8 encodings for any element.

> dput(vec)
c("Colorectal cancer patients Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure", 
"Patients with Parkinson\U3e32393cs Disease not already on levodopa", 
"hi")

Note that regular text is encoding "unknown", which has no conversion to "latin1", so simple solutions that use iconv fail. I have one attempt at a more nuanced solution below, but I'm not very happy with it.

回答1:

Going to take a stab at answering my own question and hopefully someone has a better way because I'm not convinced this will handle all funky text:

sanitize.text <- function(x) {
  stopifnot(is.character(x))
  sanitize.each.element <- function(elem) {
    ifelse(
      Encoding(elem)=="unknown",
      elem,
      iconv(elem,from=as.character(Encoding(elem)),to="latin1",sub="")
    )
  }
  x <- sapply(x, sanitize.each.element)
  names(x) <- NULL
  x
}

> sanitize.text(vec)
[1] "Colorectal cancer patients Dukes B or C, no concomittant malignancy, ulcerative colitis, Crohns disease, renal, heart or liver failure"
[2] "Patients with Parkinsons Disease not already on levodopa"                                                                              
[3] "hi"

And a function to handle MT's other import quirks:

library(taRifx)
write.sanitized.csv <- function( x, file="", ... ) {
  sanitize.text <- function(x) {
    stopifnot(is.character(x))
    sanitize.each.element <- function(elem) {
      ifelse(
        Encoding(elem)=="unknown",
        elem,
        iconv(elem,from=as.character(Encoding(elem)),to="latin1",sub="")
      )
    }
    x <- sapply(x, sanitize.each.element)
    names(x) <- NULL
    x
  }
  x <- japply( df=x, sel=sapply(x,is.character), FUN=sanitize.text)
  colnames(x) <- gsub("[^a-zA-Z0-9_]", "_", colnames(x) )
  write.csv( x, file, row.names=FALSE, ... )
}

Edit

For lack of a better place to put this code, you can figure out which element of the character vector is causing problems that even the function above won't fix with something like:

#' Function to locate a non-ASCII character
#' @param txt A character vector
#' @return A logical of length length(txt) 
locateBadString <- function(txt) {
  vapply(txt, function(x) {
    class( try( substr( x, 1, nchar(x) ) ) )!="try-error"
  }, TRUE )
}

Edit2

I think that this should work:

iconv(x, to = "latin1", sub="")

Thanks to @Masoud in this answer: https://stackoverflow.com/a/20250920/636656

来源：https://stackoverflow.com/questions/11345212/sanitize-text-for-mechanical-turk

标签

mechanicalturk