extract href values containing keyword using XPath in python

问题

I know variants of this question have been asked a number of times but I've not been able to crack it and get what I want.

I have a website which has a few tables in it. The table of interest contains a column where each row contains the word Text hyperlinked to a different page. Here is a specific example from the first row on the above linked page:

<a href="_alexandria_RIC_VI_099b_K-AP.txt">Text</a>

This is the general pattern:

<a href="_something_something-blah-blah.txt">Text</a>

Right now I'm doing this:

import requests  
import lxml.html as lh
page = requests.get("http://www.wildwinds.com/coins/ric/constantine/t.html")
doc = lh.fromstring(page.content)
href_elements = doc.xpath('/html/body/center/table/tbody/tr/td/a/@href')
print(href_elements)

The desired response should be an array of items looking like this: _something_something-blah-blah.txt What I get is an empty array.

Since the page has other href elements I'm not interested in, I also want to modify the query to only grab the href elements that contain .txt in their values.

Any help you can provide is much appreciated!

回答1:

Try something like:

href_elements = doc.xpath('//center//table//a[contains(@href,".txt")]["Text"]/@href')
for href in href_elements:
    print(href)

Output:

_alexandria_RIC_VI_099b_K-AP.txt
_alexandria_RIC_VI_100.txt
_alexandria_RIC_VI_136.txt
_alexandria_RIC_VI_156.txt

etc.

来源：https://stackoverflow.com/questions/63876794/extract-href-values-containing-keyword-using-xpath-in-python

标签

python

xpath

lxml

href

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!