Using BeautifulSoup to search html for string

后端 未结 4 1271
予麋鹿
予麋鹿 2020-11-30 01:20

I am using BeautifulSoup to look for user entered strings on a specific page. For example, I want to see if the string \'Python\' is located on the page: http://python.org<

相关标签:
4条回答
  • 2020-11-30 01:27

    text='Python' searches for elements that have the exact text you provided:

    import re
    from BeautifulSoup import BeautifulSoup
    
    html = """<p>exact text</p>
       <p>almost exact text</p>"""
    soup = BeautifulSoup(html)
    print soup(text='exact text')
    print soup(text=re.compile('exact text'))
    

    Output

    [u'exact text']
    [u'exact text', u'almost exact text']
    

    "To see if the string 'Python' is located on the page http://python.org":

    import urllib2
    html = urllib2.urlopen('http://python.org').read()
    print 'Python' in html # -> True
    

    If you need to find a position of substring within a string you could do html.find('Python').

    0 讨论(0)
  • 2020-11-30 01:30

    In addition to the accepted answer. You can use a lambda instead of regex:

    from bs4 import BeautifulSoup
    
    html = """<p>test python</p>"""
    
    soup = BeautifulSoup(html, "html.parser")
    
    print(soup(text="python"))
    print(soup(text=lambda t: "python" in t))
    

    Output:

    []
    ['test python']
    
    0 讨论(0)
  • 2020-11-30 01:44

    The following line is looking for the exact NavigableString 'Python':

    >>> soup.body.findAll(text='Python')
    []
    

    Note that the following NavigableString is found:

    >>> soup.body.findAll(text='Python Jobs') 
    [u'Python Jobs']
    

    Note this behaviour:

    >>> import re
    >>> soup.body.findAll(text=re.compile('^Python$'))
    []
    

    So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.

    0 讨论(0)
  • 2020-11-30 01:46

    I have not used BeuatifulSoup but maybe the following can help in some tiny way.

    import re
    import urllib2
    stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page
    
    # Replace the string Python with your desired regex
    results = re.findall('(Python)',stuff)
    
    for i in results:
        print i
    

    I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.

    0 讨论(0)
提交回复
热议问题