Python lxml - returns null list

孤街浪徒 提交于 2020-01-25 05:58:02

问题


I cannot figure out what is wrong with the XPATH when trying to extract a value from a webpage table. The method seems correct as I can extract the page title and other attributes, but I cannot extract the third value, it always returns an empty list?

from lxml import html
import requests

test_url = 'SC312226'

page = ('https://www.opencompany.co.uk/company/'+test_url)

print 'Now searching URL: '+page

data = requests.get(page)
tree = html.fromstring(data.text)

print tree.xpath('//title/text()') # Get page title  
print tree.xpath('//a/@href') # Get href attribute of all links  
print tree.xpath('//*[@id="financial"]/table/tbody/tr/td[1]/table/tbody/tr[2]/td[1]/div[2]/text()')

Unless i'm missing something, it would appear the XPATH is correct:

Chrome screenshot

I checked Chrome console, appears ok! So i'm at a loss

$x ('//*[@id="financial"]/table/tbody/tr/td[1]/table/tbody/tr[2]/td[1]/div[2]/text()')
[
"£432,272"
]

回答1:


You should specify element name. If you don't want specify specific tag name, you can use *:

print tree.xpath('//*[@id="financial"]/...')
                    ^

UPDATE

In the html file (just the html before the rendering in the browser), there's no tbody tag. So you need to remove tbody from the expression:

//*[@id="financial"]/table/tr/td[1]/table/tr[2]/td[1]/div[2]/text()

Alternative way using following-sibling axis:

//div[text()="Total Assets"]/following-sibling::div/text()


来源:https://stackoverflow.com/questions/25367339/python-lxml-returns-null-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!