问题
I am trying to parse this html to get the item title (e.g. Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW)
<div style="" class="">
<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1>
<h2 id="subTitle" class="it-sttl">
Brand New + Free Shipping, Satisfaction Guaranteed! </h2>
<!-- DO NOT change linkToTagId="rwid" as the catalog response has this ID set -->
<div class="vi-hdops-three-clmn-fix">
<div style="" class="vi-notify-new-bg-wrapper">
<div class="vi-notify-new-bg-dTop" style=""> </div>
<div id="vi_notification_new" class="vi-notify-new-bg-dBtm" style="top: -28px;">
<img src="https://ir.ebaystatic.com/rs/v/tnj4p1myre1mpff12w4j1llndmc.png" width="11" height="12" class="vi-notify-new-img" alt="Popular">
<span style="font-weight:bold;">5 sold in last 24 hours</span>
</div>
</div>
</div>
</div>
I am using the following code to parse the page
url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244? epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item in soup.findAll('h1', {'class':'it-ttl'}):
print(item.string) # Use item.text
get_single_item_data(url1)
When I do this, beautifulsoup return 'None'.
One solution I found is to use print(item.text) instead, but now I get this 'Details about Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW'(I do not want 'Details about ').
Is there an efficient way to get the item title without having to get the text and then taking off the 'Details about '?
回答1:
This is because of this caveat of the .string
attribute:
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
Since the header element contains multiple children - it cannot be defined and defaults to None
.
To avoid cutting of "Details about" part, you can get the first text node in a non-recursive mode:
soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False)
Demo:
In [3]: soup = BeautifulSoup(data, "html.parser")
In [4]: print(soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False))
Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW
回答2:
You [sh/co]uld use .text
instead .string
:
from bs4 import BeautifulSoup
import requests
url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244? epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text,'html.parser')
for item in soup.findAll('h1', {'class':'it-ttl'}):
print(item.text) # Use item.text
get_single_item_data(url1)
output:
Details about Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW
来源:https://stackoverflow.com/questions/48079366/using-beautifulsoup-to-parse-string-efficiently