Beautiful Soup and extracting a div and its contents by ID

后端 未结 13 1451
死守一世寂寞
死守一世寂寞 2020-11-30 19:54
soup.find(\"tagName\", { \"id\" : \"articlebody\" })

Why does this NOT return the

...
tags
相关标签:
13条回答
  • 2020-11-30 20:22

    The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.

    divEle = soup.find(id = "articlebody")
    
    0 讨论(0)
  • 2020-11-30 20:25

    Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

    soup.select('#articlebody')
    

    If you need to specify the element's type, you can add a type selector before the id selector:

    soup.select('div#articlebody')
    

    The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

    soup.find_all('div', id="articlebody")
    # or
    soup.find_all(id="articlebody")
    

    If you only want to select a single element, then you could just use the .find() method:

    soup.find('div', id="articlebody")
    # or
    soup.find(id="articlebody")
    
    0 讨论(0)
  • 2020-11-30 20:26
    soup.find("tagName",attrs={ "id" : "articlebody" })
    
    0 讨论(0)
  • 2020-11-30 20:30
    from bs4 import BeautifulSoup
    from requests_html import HTMLSession
    
    url = 'your_url'
    session = HTMLSession()
    resp = session.get(url)
    
    # if element with id "articlebody" is dynamic, else need not to render
    resp.html.render()
    
    soup = bs(resp.html.html, "lxml")
    soup.find("div", {"id": "articlebody"})
    
    0 讨论(0)
  • 2020-11-30 20:33

    Happened to me also while trying to scrape Google.
    I ended up using pyquery.
    Install:

    pip install pyquery
    

    Use:

    from pyquery import PyQuery    
    pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
    tag = pq('div#articlebody')
    
    0 讨论(0)
  • 2020-11-30 20:35

    have you tried soup.findAll("div", {"id": "articlebody"})?

    sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...

    0 讨论(0)
提交回复
热议问题