Extract text from search result URLs using R

前端 未结 2 1332
礼貌的吻别
礼貌的吻别 2020-12-22 09:58

I know R a bit, but not a pro. I am working on a text-mining project using R.

I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of

相关标签:
2条回答
  • 2020-12-22 10:12

    This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped. Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.

     library(xml2)
     library(rvest)
    
     urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"
    
      urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
           .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
             Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
               c(paste("tmp",1:length(.))))
    

    This is the breakdown of the code above: The url you want to scrap from:

     urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"
    

    Get all the url's that you need:

      allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]
    

    Where do you want to save your texts?? Create the temp files:

     tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")
    

    as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:

      allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
             Map(function(x,y) write_html(x,y,options="format"),.,tmps)
    

    Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration. Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

    Hope this helps.

    0 讨论(0)
  • 2020-12-22 10:19

    Here you go. For the main search page, you can use a regular expression as the URL are easily identifiable in the source code.

    (with the help of https://statistics.berkeley.edu/computing/r-reading-webpages)

    library('RCurl')
    library('stringr')
    library('XML')
    
    pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
    start=10&Search=&number=10&text=inflation')
    urlPattern <- 'URL: <a href="(.+)">'
    urlLines <- grep(urlPattern, pageToRead, value=TRUE)
    
    getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
    gg <- gregexpr(urlPattern, urlLines)
    matches <- mapply(getexpr, urlLines, gg)
    result = gsub(urlPattern,'\\1', matches)
    names(result) = NULL
    
    
    for (i in 1:length(result)) {
      subURL <- result[i]
    
      if (str_sub(subURL, -4, -1) == ".htm") {
        content <- readLines(subURL)
        doc <- htmlParse(content, asText=TRUE)
        doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
        writeLines(doc, paste("inflationText_", i, ".txt", sep=""))
    
      }
    }
    

    However, as you probably noticed, this parses only the .htm pages, for the .pdf documents that are linked in the search result, I would advise you go have a look there: http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/

    0 讨论(0)
提交回复
热议问题