Beautiful Soup and extracting a div and its contents by ID

后端 未结 13 1448
死守一世寂寞
死守一世寂寞 2020-11-30 19:54
soup.find(\"tagName\", { \"id\" : \"articlebody\" })

Why does this NOT return the

...
tags
相关标签:
13条回答
  • 2020-11-30 20:15

    Here is a code fragment

    soup = BeautifulSoup(:"index.html")
    titleList = soup.findAll('title')
    divList = soup.findAll('div', attrs={ "class" : "article story"})
    

    As you can see I find all tags and then I find all tags with class="article" inside

    0 讨论(0)
  • 2020-11-30 20:17

    I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

    This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

    The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

    This is my code, where I just try to print the number of tags "div" with class "fcontent":

    from BeautifulSoup import BeautifulSoup 
    f = open('/Users/myUserName/Desktop/contacts.html')
    soup = BeautifulSoup(f) 
    list = soup.findAll('div', attrs={'class':'fcontent'})
    print len(list)
    
    0 讨论(0)
  • 2020-11-30 20:17

    I used:

    soup.findAll('tag', attrs={'attrname':"attrvalue"})
    

    As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.

    0 讨论(0)
  • 2020-11-30 20:19

    To find an element by its id:

    div = soup.find(id="articlebody")
    
    0 讨论(0)
  • 2020-11-30 20:19

    In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
    

    What I think you need to do is to specify the attrs you want such as

    source.find('div', attrs={'id':'articlebody'})
    
    0 讨论(0)
  • 2020-11-30 20:20

    Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

    0 讨论(0)
提交回复
热议问题