问题
iam new to python, im trying to scraping page that not change the url when change the page, with this code :
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("disable-extensions")
chrome_options.add_argument("disable-gpu")
chrome_options.add_argument("headless")
path =r'F:\python latian\webdriver\chromedriver.exe'
driver = webdriver.Chrome(options=chrome_options, executable_path = path)
driver.get('http://lpse.maroskab.go.id/eproc4/lelang')
sleep(5)
page=bs(driver.page_source,"html.parser")
code=page.find_all(class_="sorting_1")
for xx in code:
codex=xx.contents[0]
print(codex)
but with this i only can scrape the first page, and then i find this link, but in this link the method is post, and in my case the method is get, and then i change the code to this :
import urllib.request
import requests
my_session = requests.session()
for_cookies = my_session.get("http://lpse.maroskab.go.id/eproc4/lelang")
cookies = for_cookies.cookies
url='http://lpse.maroskab.go.id/eproc4/dt/lelang?draw=2&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=2&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=3&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=false&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=4&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=desc&start=0&length=25&search%5Bvalue%5D=&search%5Bregex%5D=false&authenticityToken=ac27718c7a028e969867fd9efc0813c62033654b&_=1586612443966'
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
response = my_session.get(url, headers=header, cookies=cookies)
print (response)
with this url by changing the "length", and access via browser i get the data i want, but if i try to run with the code i get this "Response [403]", is there any workaround to avoid this block?. sorry for my english. thank you
来源:https://stackoverflow.com/questions/61179774/scraping-multiple-page-with-urllib-get-response-403