R: extracting “clean” UTF-8 text from a web page scraped with RCurl

后端 未结 2 1640
旧巷少年郎
旧巷少年郎 2020-12-01 14:50

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I

相关标签:
2条回答
  • 2020-12-01 15:13

    Hi I have wrote a scraping engine that allows for the scraping of data on webpages that are deeply embedded within the main listing page. I wonder if it might be helpful to use it as an aggregator for your web data prior to importing in R?

    The location to the engine is here http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

    The sample parameter I created to scrape the page you had in mind is as below.

    {
      origin_url: 'http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203',
      columns: [
        {
          col_name: 'links_name',
          dom_query: 'a'   
        }, {
          col_name: 'links',
          dom_query: 'a' ,
          required_attribute: 'href'
        }]
    };
    
    0 讨论(0)
  • 2020-12-01 15:17

    I seem to have found an answer and nobody else has yet posted one, so here goes.

    Earlier @kohske commented that the code worked for him once the Encoding() call was removed. That got me thinking that he probably has a Japanese locale, which in turn suggested that there was a locale issue on my machine that somehow affects R in some way - even if Perl avoids the problem. I recalibrated my search and found this question on sourcing a UTF-8 file in which the original poster had run into a similar problem. The answer involved switching the locale. I experimented and found that switching my locale to Japanese seems to solve the problem, as this screenshot shows:

    Output from updated R code

    Updated R code follows.

    require(RCurl)
    require(XML)
    
    links <- list()
    links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"
    links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"
    
    print(Sys.getlocale(category = "LC_CTYPE"))
    original_ctype <- Sys.getlocale(category = "LC_CTYPE")
    Sys.setlocale("LC_CTYPE","japanese")
    
    txt <- getURL(links, .encoding = "UTF-8")
    
    write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
    Sys.setlocale("LC_CTYPE", original_ctype)
    

    So we have to programmatically mess around with the locale. Frankly I'm a bit embarassed that we apparently need such a kludge for R on Windows in the year 2012. As I note above, Perl on the same version of Windows and in the same locale gets round the issue somehow, without requiring me to change my system settings.

    The output of the updated R code above is HTML, of course. For those interested, the following code succeeds fairly well in stripping out the HTML and saving raw text, although the result needs quite a lot of tidying up.

    require(RCurl)
    require(XML)
    
    links <- list()
    links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"
    links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"
    
    print(Sys.getlocale(category = "LC_CTYPE"))
    original_ctype <- Sys.getlocale(category = "LC_CTYPE")
    Sys.setlocale("LC_CTYPE","japanese")
    
    txt <- getURL(links, .encoding = "UTF-8")
    myhtml <- htmlTreeParse(txt, useInternal = TRUE)
    cleantxt <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
    
    write.table(cleantxt, "c:/geturl_r.txt", col.names = FALSE, quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
    Sys.setlocale("LC_CTYPE", original_ctype)
    
    0 讨论(0)
提交回复
热议问题