Regex in lxml for python

问题

I having trouble implementing regex within xpath command. My goal here is to download the html contents of the main page, as well as the contents of all hyperlinks on the main page. However, the program throws exceptions because some of the href links do not connect to anything (ex. '//:javascript', or '#'). How would I use regex in xpath? Is there an easier way to except non-absolute hrefs?

from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
    link_page = requests.get(href)
    with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_page.text)

回答1:

with xpath 1.0 you can always use or in your predicate:

hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')

回答2:

According to the documentation, lxml support EXSLT extension, which, in turn, support regex :

lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.

For example, using EXSLT re:test() function :

....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....

来源：https://stackoverflow.com/questions/34850280/regex-in-lxml-for-python

标签

python

regex

xpath

html-parsing