How to extract URL from HTML anchor element using Python3? [closed]

余生颓废 提交于 2019-12-13 22:50:55

问题


I want to extract URL from web page HTML source.
Example:

xyz.com source code:
<a rel="nofollow" href="example/hello/get/9f676bac2bb3.zip">Download XYZ</a>

I want to extract:

example/hello/get/9f676bac2bb3.zip

How to extract this URL?

I don't understand regex. Also I don't know how to install Beautiful Soup 4 or lxml on Windows. I'm getting errors when I try to install this libraries.

I've tried:

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

But just it's a hardcoded HTML sample. How to get the web page source and run my code against it?


回答1:


You can use built-in xml.etree.ElementTree instead:

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

This works on this particular example, but xml.etree.ElementTree is not an HTML parser. Consider using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Or, lxml.html:

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Personally, I prefer BeautifulSoup - it makes html-parsing easy, transparent and fun.


To follow the link and download the file, you need to make a full url including the schema and domain (urljoin() would help) and then use urlretrieve(). Example:

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD (for the different html posted in comments):

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'


来源:https://stackoverflow.com/questions/25121127/how-to-extract-url-from-html-anchor-element-using-python3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!