问题
I'm working on a project that require to extract all links from a website, with using this code I'll get all of links from single URL:
import requests
from bs4 import BeautifulSoup, SoupStrainer
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []
for link in soup.find_all('a'):
links.append(str(link))
problem is that if I want to extract all URLs, I have to write another for loop and then another one ... . I want to extract all URLs that are exist in this website and in this website's sub domains. is there any way to do this without writing nested for? and even with writing nested for, I don't know how many for should I use to get all URLs.
回答1:
WoW, it takes about 30 min to find a solution, I found a simple and efficient way to do this, As @αԋɱҽԃ-αмєяιcαη mentioned, some time if your website linked to a BIG website like google, etc, it wont be stop until you memory get full of data. so there are steps that you should consider.
- make a while loop to seek thorough your website to extract all of urls
- use Exceptions handling to prevent crashes
- remove duplicates and separate the urls
- set a limitation to number of urls, like when 1000 urls found
- stop while loop to prevent your PC's memory getting full
here a sample code and it should works fine, I actually tested it and it was fun fore me:
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
def remove_duplicates(l): # remove duplicates and unURL string
for item in l:
match = re.search("(?P<url>https?://[^\s]+)", item)
if match is not None:
links.append((match.group("url")))
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
flag = True
remove_duplicates(data)
while flag:
try:
for link in links:
for j in soup.find_all('a', href=True):
temp = []
source_code = requests.get(link)
soup = BeautifulSoup(source_code.content, 'lxml')
temp.append(str(j.get('href')))
remove_duplicates(temp)
if len(links) > 162: # set limitation to number of URLs
break
if len(links) > 162:
break
if len(links) > 162:
break
except Exception as e:
print(e)
if len(links) > 162:
break
for url in links:
print(url)
and the output will be:
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
https://stackoverflow.com/teams
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
https://insights.stackoverflow.com/
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/jobs
https://stackoverflow.com/jobs/directory/developer-jobs
https://stackoverflow.com/jobs/salary
https://www.stackoverflowbusiness.com
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/enterprise
https://stackoverflow.com/company/about
https://stackoverflow.com/company/about
https://stackoverflow.com/company/press
https://stackoverflow.com/company/work-here
https://stackoverflow.com/legal
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/company/contact
https://stackexchange.com
https://stackoverflow.com
https://serverfault.com
https://superuser.com
https://webapps.stackexchange.com
https://askubuntu.com
https://webmasters.stackexchange.com
https://gamedev.stackexchange.com
https://tex.stackexchange.com
https://softwareengineering.stackexchange.com
https://unix.stackexchange.com
https://apple.stackexchange.com
https://wordpress.stackexchange.com
https://gis.stackexchange.com
https://electronics.stackexchange.com
https://android.stackexchange.com
https://security.stackexchange.com
https://dba.stackexchange.com
https://drupal.stackexchange.com
https://sharepoint.stackexchange.com
https://ux.stackexchange.com
https://mathematica.stackexchange.com
https://salesforce.stackexchange.com
https://expressionengine.stackexchange.com
https://pt.stackoverflow.com
https://blender.stackexchange.com
https://networkengineering.stackexchange.com
https://crypto.stackexchange.com
https://codereview.stackexchange.com
https://magento.stackexchange.com
https://softwarerecs.stackexchange.com
https://dsp.stackexchange.com
https://emacs.stackexchange.com
https://raspberrypi.stackexchange.com
https://ru.stackoverflow.com
https://codegolf.stackexchange.com
https://es.stackoverflow.com
https://ethereum.stackexchange.com
https://datascience.stackexchange.com
https://arduino.stackexchange.com
https://bitcoin.stackexchange.com
https://sqa.stackexchange.com
https://sound.stackexchange.com
https://windowsphone.stackexchange.com
https://stackexchange.com/sites#technology
https://photo.stackexchange.com
https://scifi.stackexchange.com
https://graphicdesign.stackexchange.com
https://movies.stackexchange.com
https://music.stackexchange.com
https://worldbuilding.stackexchange.com
https://video.stackexchange.com
https://cooking.stackexchange.com
https://diy.stackexchange.com
https://money.stackexchange.com
https://academia.stackexchange.com
https://law.stackexchange.com
https://fitness.stackexchange.com
https://gardening.stackexchange.com
https://parenting.stackexchange.com
https://stackexchange.com/sites#lifearts
https://english.stackexchange.com
https://skeptics.stackexchange.com
https://judaism.stackexchange.com
https://travel.stackexchange.com
https://christianity.stackexchange.com
https://ell.stackexchange.com
https://japanese.stackexchange.com
https://chinese.stackexchange.com
https://french.stackexchange.com
https://german.stackexchange.com
https://hermeneutics.stackexchange.com
https://history.stackexchange.com
https://spanish.stackexchange.com
https://islam.stackexchange.com
https://rus.stackexchange.com
https://russian.stackexchange.com
https://gaming.stackexchange.com
https://bicycles.stackexchange.com
https://rpg.stackexchange.com
https://anime.stackexchange.com
https://puzzling.stackexchange.com
https://mechanics.stackexchange.com
https://boardgames.stackexchange.com
https://bricks.stackexchange.com
https://homebrew.stackexchange.com
https://martialarts.stackexchange.com
https://outdoors.stackexchange.com
https://poker.stackexchange.com
https://chess.stackexchange.com
https://sports.stackexchange.com
https://stackexchange.com/sites#culturerecreation
https://mathoverflow.net
https://math.stackexchange.com
https://stats.stackexchange.com
https://cstheory.stackexchange.com
https://physics.stackexchange.com
https://chemistry.stackexchange.com
https://biology.stackexchange.com
https://cs.stackexchange.com
https://philosophy.stackexchange.com
https://linguistics.stackexchange.com
https://psychology.stackexchange.com
https://scicomp.stackexchange.com
https://stackexchange.com/sites#science
https://meta.stackexchange.com
https://stackapps.com
https://api.stackexchange.com
https://data.stackexchange.com
https://stackoverflow.blog?blb=1
https://www.facebook.com/officialstackoverflow/
https://twitter.com/stackoverflow
https://linkedin.com/company/stack-overflow
https://creativecommons.org/licenses/by-sa/4.0/
https://stackoverflow.blog/2009/06/25/attribution-required/
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
Process finished with exit code 0
I set the limitation to 162, you can increase it as many as you want and you ram allowed.
回答2:
How's this?
import re,requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
source_code = requests.get('https://stackoverflow.com/')
doc = SimplifiedDoc(source_code.content.decode('utf-8')) # incoming HTML string
lst = doc.listA(url='https://stackoverflow.com/') # get all links
for a in lst:
if(a['url'].find('stackoverflow.com')>0): #sub domains
print (a['url'])
You can also use this crawl frame, which can help you do many things
from simplified_scrapy.spider import Spider, SimplifiedDoc
class DemoSpider(Spider):
name = 'demo-spider'
start_urls = ['http://www.example.com/']
allowed_domains = ['example.com/']
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"])
return [{"Urls": lstA, "Data": None}]
from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(DemoSpider())
回答3:
Well, actually what you are asking for is possible but that's mean an infinite loop which will keep run and run till your memory BoOoOoOm
Anyway the idea should be like the following.
you will use
for item in soup.findAll('a')
andthen item.get('href')
add to
set
to get rid of duplicates urls and use withif
conditionis not None
to get rid ofNone
objects.then keep looping over and over till your
set
became0
something likelen(urls)
来源:https://stackoverflow.com/questions/59347372/how-extract-all-urls-in-a-website-using-beautifulsoup