Using BeautifulSoup to search html for string

后端未结

关注

 4  1274

I am using BeautifulSoup to look for user entered strings on a specific page. For example, I want to see if the string \'Python\' is located on the page: http://python.org<

Output

[u'exact text']
[u'exact text', u'almost exact text']

"To see if the string 'Python' is located on the page http://python.org":

import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True

If you need to find a position of substring within a string you could do html.find('Python').

0 讨论(0)

广开言路

2020-11-30 01:30

In addition to the accepted answer. You can use a lambda instead of regex:

from bs4 import BeautifulSoup

html = """<p>test python</p>"""

soup = BeautifulSoup(html, "html.parser")

print(soup(text="python"))
print(soup(text=lambda t: "python" in t))

Output:

[]
['test python']

0 讨论(0)

無奈伤痛

2020-11-30 01:44
The following line is looking for the exact NavigableString 'Python':
```
>>> soup.body.findAll(text='Python')
[]
```
Note that the following NavigableString is found:
```
>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']
```
Note this behaviour:
```
>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]
```
So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-11-30 01:46
I have not used BeuatifulSoup but maybe the following can help in some tiny way.
```
import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page

# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)

for i in results:
    print i
```
I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.
0 讨论(0)
发布评论:

提交评论
- 加载中...