Regex in lxml for python

旧时模样 提交于 2021-01-28 04:10:08

问题


I having trouble implementing regex within xpath command. My goal here is to download the html contents of the main page, as well as the contents of all hyperlinks on the main page. However, the program throws exceptions because some of the href links do not connect to anything (ex. '//:javascript', or '#'). How would I use regex in xpath? Is there an easier way to except non-absolute hrefs?

from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
    link_page = requests.get(href)
    with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_page.text)

回答1:


with xpath 1.0 you can always use or in your predicate:

hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')



回答2:


According to the documentation, lxml support EXSLT extension, which, in turn, support regex :

lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.

For example, using EXSLT re:test() function :

....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....


来源:https://stackoverflow.com/questions/34850280/regex-in-lxml-for-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!