问题
I am using Python 3.5 and trying to scrape a list of urls (from the same website), code as follows:
import urllib.request
from bs4 import BeautifulSoup
url_list = ['URL1',
'URL2','URL3]
def soup():
for url in url_list:
sauce = urllib.request.urlopen(url)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
return soup_maker
# Scraping
def getPropNames():
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
def getPrice():
for price in soup.findAll('p', class_="room-price"):
print(price.text)
def getRoom():
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
for soups in soup():
getPropNames()
getPrice()
getRoom()
So far, if I print soup, get propNames, getPrice or getRoom they seem to work. But I can't seem to get it go through each of the urls and print getPropNames, getPrice and getRoom.
Only been learning Python a few months so would greatly appreciate some help with this please!
回答1:
Just think what this code do:
def soup():
for url in url_list:
sauce = urllib.request.urlopen(url)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
return soup_maker
Let me show you an example:
def soup2():
for url in url_list:
print(url)
for thing in ['a', 'b', 'c']:
print(url, thing)
maker = 2 * thing
return maker
And the output for url_list = ['one', 'two', 'three']
is:
one
('one', 'a')
Do you see now? What is going on?
Basically your soup function return on first return
- do not return any iterator, any list; only the first BeautifulSoup
- you are lucky (or not) that this is iterable :)
So change the code:
def soup3():
soups = []
for url in url_list:
print(url)
for thing in ['a', 'b', 'c']:
print(url, thing)
maker = 2 * thing
soups.append(maker)
return soups
And then output is:
one
('one', 'a')
('one', 'b')
('one', 'c')
two
('two', 'a')
('two', 'b')
('two', 'c')
three
('three', 'a')
('three', 'b')
('three', 'c')
But I believe that this also will not work :) Just wonder what is returned by sauce: sauce = urllib.request.urlopen(url)
and actually on what your code is iterating on: for things in sauce
- mean what the things
is.
Happy coding.
回答2:
Each of the get*
functions uses a global variable soup
which is not set correctly anywhere. Even if it were, it would not be a good approach. Make soup
a function argument instead, e.g.:
def getRoom(soup):
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
Secondly, you should being doing yield
from soup()
instead of return
to turn it into a generator. Otherwise you would need to return a list of BeautifulSoup
objects.
def soups():
for url in url_list:
sauce = urllib.request.urlopen(url)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield soup_maker
I'd also suggest using XPath or CSS selectors to extract HTML elements: https://stackoverflow.com/a/11466033/2997179.
来源:https://stackoverflow.com/questions/42299268/scraping-a-list-of-urls