How can i grab pdf links from website with Python script

前端未结

关注

 3  1372

我寻月下人不归 2020-12-29 00:41

Quite often i have to download the pdfs from websites but sometimes they are not on one page. They have divided the links in pagination and I have to click on every page of

3条回答

礼貌的吻别 (楼主)

2020-12-29 01:22
By phone, maybe it is not very readable

If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
```
import requests
page_content=requests.get(url)
```
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
- Frist way:make your requests more like a browser(human). add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers) make the right post form.This one should copy the ways you post the form by browser. get the cookies, and add it to requests
- Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver) remeber to add chromedriver to the path Or use code to load the driver.exe driver=WebDriver.Chrome(path) not sure is this set up code
  
  driver.get(url) It is trully surf the url by browser, so it will decrease the difficulty of grabing things
  
  get the web page page=driver.page_soruces
  
  some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
  
  try: certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID) WebDriverWait(certain_element)
  
  or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)

And you can controll the website by WebDriver. Here is not going to describe. You can search the module.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...