Web scraping pdf files from HTML

后端 未结 1 1449
臣服心动
臣服心动 2021-01-15 02:07

How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows.

相关标签:
1条回答
  • 2021-01-15 02:27

    When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.

    library(XML)
    library(RCurl)
    
    url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
    page   <- getURL(url)
    parsed <- htmlParse(page)
    links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
    inds   <- grep("*.pdf", links)
    links  <- links[inds]
    

    links contains all the URLs to the PDF-files you are trying to download.

    Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.

    With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls

    regex_match <- regexpr("[^/]+$", links, perl=TRUE)
    destination <- regmatches(links, regex_match)
    

    To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:

    for(i in seq_along(links)){
      download.file(links[i], destfile=destination[i])
      Sys.sleep(runif(1, 1, 5))
    }
    
    0 讨论(0)
提交回复
热议问题