lxml classic: Get text content except for that of nested tags?

后端 未结 2 1076
无人及你
无人及你 2021-01-14 15:49

This must be an absolute classic, but I can\'t find the answer here. I\'m parsing the following tag with lxml cssselect:

  • 相关标签:
    2条回答
    • 2021-01-14 16:29

      For your example, I think going with XPath is cleaner and easier than CSS:

      >>> xml = '<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>'
      >>> root = etree.fromstring(xml)
      >>> print( root.xpath('/li/a/text()'))
      [' Detroit']
      
      >>> xml = '<li><a href="/stations/1">I <span>FooBar!</span> love <span class="num">3</span> Detroit</a></li>'
      >>> root = etree.fromstring(xml)
      >>> print( root.xpath('/li/a/text()'))
      ['I ', ' love ', ' Detroit']
      
      >>> ' '.join([x.strip() for x in root.xpath('/li/a/text()')])
      'I love Detroit'
      
      0 讨论(0)
    • 2021-01-14 16:35

      itertext method of an element returns an iterator of node's text data. For your <a> tag, ' Detroit' would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.

      from lxml import html
      
      doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
      stop_nodes = doc.cssselect('li a') 
      stop_names = []
      for start in stop_list:
          node_text = start.itertext()
          node_text.next() # Skip '3'
          stop_names.append(node_text.next().lstrip())
          continue
      

      You can combine css selector with the xpath text() function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):

      stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]
      
      0 讨论(0)
    提交回复
    热议问题