Web scraping pdf files from HTML

◇◆丶佛笑我妖孽 提交于 2019-12-01 11:29:20

问题


How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows.

https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx

Regards


回答1:


When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.

library(XML)
library(RCurl)

url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]

links contains all the URLs to the PDF-files you are trying to download.

Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.

With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls

regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)

To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:

for(i in seq_along(links)){
  download.file(links[i], destfile=destination[i])
  Sys.sleep(runif(1, 1, 5))
}


来源:https://stackoverflow.com/questions/46523977/web-scraping-pdf-files-from-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!