For your example, I think going with XPath is cleaner and easier than CSS:
>>> xml = '<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
[' Detroit']
>>> xml = '<li><a href="/stations/1">I <span>FooBar!</span> love <span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
['I ', ' love ', ' Detroit']
>>> ' '.join([x.strip() for x in root.xpath('/li/a/text()')])
'I love Detroit'
itertext
method of an element returns an iterator of node's text data. For your <a>
tag, ' Detroit'
would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.
from lxml import html
doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
stop_nodes = doc.cssselect('li a')
stop_names = []
for start in stop_list:
node_text = start.itertext()
node_text.next() # Skip '3'
stop_names.append(node_text.next().lstrip())
continue
You can combine css selector with the xpath text()
function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):
stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]