Using Beautiful Soup Python module to replace tags with plain text

后端未结

关注

 2  1959

花落未央 2021-01-07 05:14

I am using Beautiful Soup to extract \'content\' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that\'s h

2条回答

借酒劲吻你 (楼主)

2021-01-07 05:35
An approach that works for your specific example is:
```
from BeautifulSoup import BeautifulSoup

ht = '''

    some long text goes  here  and hopefully it 
    will get picked up by the parser as content

'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results
```
which emits
```
$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']
```
Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

I imagine that besides tags you'll also want to remove others, such as or , maybe
and/or , etc...? I guess this, too, depends on what the actual idea behind your heuristics is!
0 讨论(0)

查看其它2个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复