How to print paragraphs and headings simultaneously while scraping in Python?

坚强是说给别人听的谎言 提交于 2021-02-08 11:52:06

问题


I am a beginner in python. I am currently using Beautifulsoup to scrape a website.

str='' #my_url
source = urllib.request.urlopen(str);
soup = bs.BeautifulSoup(source,'lxml');
match=soup.find('article',class_='xyz');
for paragraph in match.find_all('p'):
    str+=paragraph.text+"\n"

My tag Structure -

<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>


I am getting output like this (as I am able to extract the paragraphs) -

 efkl
 efkl
 efkl
 efkl

Output I want ( I want the headings as well as the paragraphs) -

 dr
 efkl
 dr
 efkl
 dr
 efkl
 dr
 efkl     

I want my output to also contains headings along with paragraphs.How to modify code in such a way that it contains header before paragraphs (Like in original HTML) .


回答1:


You can peel the same apple in different ways to serve the purpose. Here are few of them:

Using .find_next():

from bs4 import BeautifulSoup

content="""
<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>
"""
soup = BeautifulSoup(content,"lxml")

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.text,item.find_next("p").text]) for item in items.find_all("h4")])
    print(data)

Using .find_previous_sibling():

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.find_previous_sibling("h4").text,item.text]) for item in items.find_all("p")])
    print(data)

Commonly used approach: multiple tags used within list:

for items in soup.find_all(class_="xyz"):
    data = '\n'.join([item.text for item in items.find_all(["h4","p"])])
    print(data)

All the three approaches produce the same result:

dr
efkl
dr
efkl
dr
efkl
dr
efkl


来源:https://stackoverflow.com/questions/49083438/how-to-print-paragraphs-and-headings-simultaneously-while-scraping-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!