Issue with Regular expressions in python

前端 未结 6 914
轮回少年
轮回少年 2021-01-21 20:59

Ok, so i\'m working on a regular expression to search out all the header information in a site.

I\'ve compiled the regular expression:

regex = re.compile         


        
6条回答
  •  一生所求
    2021-01-21 21:40

    I have used beautifulsoup to parse your desired HTML. I have the above HTML code in a file called foo.html and later read as a file object.

    from BeautifulSoup import BeautifulSoup
    
    
    H_TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
    
    def extract_data():
       """Extract the data from all headers
       in a HTML page."""
       f = open('foo.html', 'r+')
       html = f.read()
       soup = BeautifulSoup(html)
       headers = [soup.findAll(h) for h in H_TAGS if soup.findAll(h)]
       lst = []
       for x in headers:
          for y in x:
             if y.string:
                lst.append(y.string)
             else:
                lst.append(y.contents[0].string)
       return lst
    

    The above function returns:

    >>> [u'Dog ', u'Tall cup of lemons', u'Dog thing', u'Cat ', u'Fancy ']
    

    You can add any number of header tags in h_tags list. I have assumed all the headers. If you can solve things easily using BeautifulSoup then its better to use it. :)

提交回复
热议问题