Scraping text in h3 and div tags using beautifulSoup, Python

后端 未结 3 841
忘掉有多难
忘掉有多难 2020-12-31 20:51

I have no experience with python, BeautifulSoup, Selenium etc. but I\'m eager to scrape data from a website and store as a csv file. A single sample of data I need is coded

相关标签:
3条回答
  • 2020-12-31 21:23

    Try this:

    import urllib2
    from bs4 import BeautifulSoup
    import requests
    import csv
    
    MAX = 2
    
    '''with open("lg.csv", "a") as f:
      w=csv.writer(f)'''
    ##for i in range(1,MAX+1)
    url="http://www.example_site.com"
    
    page=requests.get(url)
    soup = BeautifulSoup(page,"html.parser")
    
    print(soup.text)
    
    0 讨论(0)
  • 2020-12-31 21:25

    So it seemed quite nice:

        #  -*- coding: utf-8 -*-
        # by Faguiro #
        # run using Python 3.8.6  on Linux#
        import requests
        from bs4 import BeautifulSoup
    
        # insert your site here
        url= input("Enter the url-->")
    
        #use requests
        r = requests.get(url)
        content = r.content
    
        #soup!
        soup = BeautifulSoup(content, "html.parser")
    
        #find all tag in the soup.
        heading = soup.find_all("h3")
    
        #print(heading) <--- result...
    
        #...ptonic organization!
        n=len(heading)
        for x in range(n): 
            print(str.strip(heading[x].text))
    

    Dependencies: On terminal (linux):

    sudo apt-get install python3-bs4

    0 讨论(0)
  • 2020-12-31 21:26

    You can use CSS selectors to find the data you need. In your case div > h3 ~ div will find all div elements that are directly inside a div element and are proceeded by a h3 element.

    import bs4
    
    page= """
    <div class="box effect">
    <div class="row">
    <div class="col-lg-10">
        <h3>HEADING</h3>
        <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
        <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
        <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
        <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
    </div>
    </div>
    </div>
    """
    
    soup = bs4.BeautifulSoup(page, 'lxml')
    
    # find all div elements that are inside a div element
    # and are proceeded by an h3 element
    selector = 'div > h3 ~ div'
    
    # find elements that contain the data we want
    found = soup.select(selector)
    
    # Extract data from the found elements
    data = [x.text.split(';')[-1].strip() for x in found]
    
    for x in data:
        print(x)
    

    Edit: To scrape the text in heading..

    heading = soup.find('h3') 
    heading_data = heading.text
    print(heading_data)
    

    Edit: Or you can get the heading and other div elements at once by using a selector like this: div.col-lg-10 > *. This finds all elements inside a div element that belongs to col-lg-10 class.

    soup = bs4.BeautifulSoup(page, 'lxml')
    
    # find all elements inside a div element of class col-lg-10
    selector = 'div.col-lg-10 > *'
    
    # find elements that contain the data we want
    found = soup.select(selector)
    
    # Extract data from the found elements
    data = [x.text.split(';')[-1].strip() for x in found]
    
    for x in data:
        print(x)
    
    0 讨论(0)
提交回复
热议问题