问题
I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')
回答1:
First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']
回答2:
This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty
来源:https://stackoverflow.com/questions/24789094/css-selectors-to-be-used-for-scraping-specific-links