问题
Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage.
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
The result is
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
I want to get the URL of only "about us" page of a website which differs in many cases like
for Udacity it is https://www.udacity.com/us
For artscape-inc it is https://www.artscape-inc.com/about-decorative-window-film/
I mean, i could try searching for keywords like "about" in the URLs but as said i might have missed udacity in this approach. Could anyone suggest any good approach?
回答1:
It would not be easy to cover every possible variation of an "About us" page link, but here is the initial idea that would work in both cases you've shown - check for "about" inside the href
attribute and the text of a
elements:
def about_links(elm):
return elm.name == "a" and ("about" in elm["href"].lower() or \
"about" in elm.get_text().lower())
Usage:
soup.find_all(about_links) # or soup.find(about_links)
What you can also do to decrease the number of false positives is to check "footer" part of the page only. E.g. find footer
element, or an element with id="footer"
or having a footer
class.
Another idea to sort of "outsource" the "about us" page definition, would be to google (from your script, of course) "about" + "webpage url" and grab the first search result.
As a side note, I've noticed you are still using BeautifulSoup version 3 - it is not being developed and maintained and you should switch to BeautifulSoup 4 as soon as possible, install it via:
pip install --upgrade beautifulsoup4
And change your import to:
from bs4 import BeautifulSoup
来源:https://stackoverflow.com/questions/36816302/how-can-i-make-sure-that-i-am-on-about-us-page-of-a-particular-website