问题
I am required to scrape PDF's from the website https://secc.gov.in/lgdStateList
. There are 3 drop-down menus for a state, a district and a block.
There are several states, under each state we have districts and under each district there are blocks.
I tried to implement the following code. I was able to select the state, but there seems to be some error when I select the district.
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
url = ("https://secc.gov.in/lgdStateList")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for name_list in soup.find_all(class_ ='dropdown-row'):
print(name_list.text)
driver = webdriver.Chrome()
driver.get('https://secc.gov.in/lgdStateList')
selectState = Select(driver.find_element_by_id("lgdState"))
for state in selectState.options:
state.click()
selectDistrict = Select(driver.find_element_by_id("lgdDistrict"))
for district in selectDistrict.options:
district.click()
selectBlock = Select(driver.find_element_by_id("lgdBlock"))
for block in selectBlock.options():
block.click()
The error I ran into is :
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lgdDistrict"]"}
(Session info: chrome=83.0.4103.106)
I need help crawling through the 3 menus.
Any help/suggestions would be really appreciated. Let me know of any clarifications in the comments.
回答1:
This is where you can find the value of different states. You can find the same from district and block dropdowns.
You should now use those values within payload to get the table you would like to grab data from:
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for items in soup.select("table#example tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
Output the script produces:
['Select State', 'Select District', 'Select Block']
['', 'Select District', 'Select Block']
['ARARIA BASTI (93638)', 'BANGAMA (93639)', 'BANSBARI (93640)']
['BASANTPUR (93641)', 'BATURBARI (93642)', 'BELWA (93643)']
['BOCHI (93644)', 'CHANDRADEI (93645)', 'CHATAR (93646)']
['CHIKANI (93647)', 'DIYARI (93648)', 'GAINRHA (93649)']
['GAIYARI (93650)', 'HARIA (93651)', 'HAYATPUR (93652)']
['JAMUA (93653)', 'JHAMTA (93654)', 'KAMALDAHA (93655)']
['KISMAT KHAWASPUR (93656)', 'KUSIYAR GAWON (93657)', 'MADANPUR EAST (93658)']
['MADANPUR WEST (93659)', 'PAIKTOLA (93660)', 'POKHARIA (93661)']
['RAMPUR KODARKATTI (93662)', 'RAMPUR MOHANPUR EAST (93663)', 'RAMPUR MOHANPUR WEST (93664)']
['SAHASMAL (93665)', 'SHARANPUR (93666)', 'TARAUNA BHOJPUR (93667)']
You need to scrape the numbers available in brackets adjacent to each results above and then use them in payload
and send another post requests to download the pdf files. Make sure to put the script in a folder before execution so that you can get all the files within.
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
download_link = "https://secc.gov.in/downloadLgdwisePdfFile"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("table#example td > a[onclick^='downloadLgdFile']"):
gp_code = item.text.strip().split("(")[1].split(")")[0]
payload['gpCode'] = gp_code
with open(f'{gp_code}.pdf','wb') as f:
f.write(requests.post(download_link,data=payload,verify=False).content)
来源:https://stackoverflow.com/questions/62500126/scraping-multiple-select-options-using-selenium