How can I use the python HTMLParser library to extract data from a specific div tag?

前端 未结 4 2228
名媛妹妹
名媛妹妹 2020-11-27 13:43

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:

...
相关标签:
4条回答
  • 2020-11-27 13:46

    Have You tried BeautifulSoup ?

    from bs4 import BeautifulSoup
    soup = BeautifulSoup('<div id="remository">20</div>')
    tag=soup.div
    print(tag.string)
    

    This gives You 20 on output.

    0 讨论(0)
  • 2020-11-27 13:54
    class LinksParser(HTMLParser.HTMLParser):
      def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.recording = 0
        self.data = []
    
      def handle_starttag(self, tag, attributes):
        if tag != 'div':
          return
        if self.recording:
          self.recording += 1
          return
        for name, value in attributes:
          if name == 'id' and value == 'remository':
            break
        else:
          return
        self.recording = 1
    
      def handle_endtag(self, tag):
        if tag == 'div' and self.recording:
          self.recording -= 1
    
      def handle_data(self, data):
        if self.recording:
          self.data.append(data)
    

    self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

    The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

    The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

    0 讨论(0)
  • 2020-11-27 13:54

    Little correction at Line 3

    HTMLParser.HTMLParser.__init__(self)

    it should be

    HTMLParser.__init__(self)

    The following worked for me though

    import urllib2 
    
    from HTMLParser import HTMLParser  
    
    class MyHTMLParser(HTMLParser):
    
      def __init__(self):
        HTMLParser.__init__(self)
        self.recording = 0 
        self.data = []
      def handle_starttag(self, tag, attrs):
        if tag == 'required_tag':
          for name, value in attrs:
            if name == 'somename' and value == 'somevale':
              print name, value
              print "Encountered the beginning of a %s tag" % tag 
              self.recording = 1 
    
    
      def handle_endtag(self, tag):
        if tag == 'required_tag':
          self.recording -=1 
          print "Encountered the end of a %s tag" % tag 
    
      def handle_data(self, data):
        if self.recording:
          self.data.append(data)
    
     p = MyHTMLParser()
     f = urllib2.urlopen('http://www.someurl.com')
     html = f.read()
     p.feed(html)
     print p.data
     p.close()
    

    `

    0 讨论(0)
  • 2020-11-27 14:05

    This works perfectly:

    print (soup.find('the tag').text)
    
    0 讨论(0)
提交回复
热议问题