Converting a \u escaped Unicode string to ASCII

问题

After reading all about iconv and Encoding, I am still confused.

I am scraping the source of a web page I have a string that looks like this: \'pretty\\u003D\\u003Ebig\' (displayed in the R console as \'pretty\\\\\\u003D\\\\\\u003Ebig\'). I want to convert this to the ASCII string, which should be \'pretty=>big\'.

More simply, if I set

x <- \'pretty\\\\u003D\\\\u003Ebig\'

How do I perform a conversion on x to yield pretty=>big?

Any suggestions?

回答1:

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

回答2:

With the stringi package:

> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"

回答3:

Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.

So, I have devised an alternative, somewhat brutal, approach:

udecode <- function(string){
  uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
  ufilter <- function(string) {
    if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
  }
  string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
  strings <- unlist(strsplit(string, ","))
  string <- paste(sapply(strings, ufilter), collapse='')
  return(string)
}

Any simplifications welcomed!

回答4:

A use for eval(parse)!

eval(parse(text=paste0("'", x, "'")))

This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.

回答5:

I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:

x <- gsub("\u003D", "=>", x)

I sometimes use a construction like

lapply(x, utf8ToInt)

to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.

回答6:

> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"

but you appear to have an extra escape

回答7:

The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:

gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"

To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)

When I load your functions and the dependencies, this code works:

> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
> 
> str(freq)
'data.frame':   59 obs. of  4 variables:
 $ Year     : num  1950 1951 1952 1953 1954 ...
 $ Phrase   : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
 $ Frequency: num  1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
 $ Corpus   : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...

(So I guess I am still not clear on the use case.)

来源：https://stackoverflow.com/questions/17761858/converting-a-u-escaped-unicode-string-to-ascii

标签

unicode

text-processing

iconv

unicode-string