Beautiful Soup and extracting a div and its contents by ID

后端未结

关注

 13  1448

soup.find(\"tagName\", { \"id\" : \"articlebody\" })

Why does this NOT return the

...

相关标签:

13条回答

别那么骄傲

2020-11-30 20:15
Here is a code fragment
```
soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})
```
As you can see I find all tags and then I find all tags with class="article" inside
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-11-30 20:17
I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":
```
from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-11-30 20:17
I used:
```
soup.findAll('tag', attrs={'attrname':"attrvalue"})
```
As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-11-30 20:19
To find an element by its id:
```
div = soup.find(id="articlebody")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-11-30 20:19
In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.
```
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
```
What I think you need to do is to specify the attrs you want such as
```
source.find('div', attrs={'id':'articlebody'})
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2020-11-30 20:20

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页