问题
I wish to be able to have htmlParse work well with Hebrew, but it keeps to scramble the Hebrew text in pages I feed into it.
For example:
# why can't I parse the Hebrew correctly?
library(RCurl)
library(XML)
u = "http://humus101.com/?p=2737"
a = getURL(u)
a # Here - the hebrew is fine.
a2 <- htmlParse(a)
a2 # Here it is a mess...
None of these seem to fix it:
htmlParse(a, encoding = "utf-8")
htmlParse(a, encoding = "iso8859-8")
This is my locale:
> Sys.getlocale()
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
>
Any suggestions?
回答1:
Specify UTF-8
endoding in both the call to getURL
and htmlParse
.
a <- getURL(u, .encoding = "UTF-8")
htmlParse(a, encoding = "UTF-8")
These locale issues are always a pain to get to the bottom of. When I type cat(a)
(after specifying UTF-8
encoding in getURL
) I see that the he.wrodpress.org
page claims to be UTF-8: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
, but the Hebrew bits are UTF-16. That is, they look like <U+05D3><U+05E6><U+05DE><U+05D1><U+05E8>
. So it could be a problem caused by mixed oncoding of that web page.
Comparing several encodings, the only one that doswn't generate gibberish on my machine is UTF-8.
(trees <- lapply(c("UTF-8", "UTF-16", "latin1"), function(enc)
{
a <- getURL(u, .opts = proxy_opts, .encoding = enc)
htmlParse(a, encoding = enc)
}))
If it gets desperate, pass iconvlist()
to lapply in the above code, and see if any of the possible condings works for you.
来源:https://stackoverflow.com/questions/9061619/getting-htmlparse-to-work-with-hebrew