发表新帖

发表新帖

Reconstructing absolute urls from relative urls on a page

后端未结

关注

 2  771

逝去的感伤

Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best

相关标签:

2条回答

梦谈多话

2020-12-30 23:45
Use urllib.parse.urljoin to resolve a (possibly relative) URL against a base URL.

But, the base URL of a web page isn't necessarily the same as the URL you fetched the document from, because HTML allows a page to specify its preferred base URL via the BASE element. The logic you need is as follows:
```
base_url = page_url
head = document.getElementsByTagName('head')[0]
for base in head.getElementsByTagName('base'):
    if base.hasAttribute('href'):
        base_url = urllib.parse.urljoin(base_url, base.getAttribute('href'))
        # HTML5 4.2.3 "if there are multiple base elements with href
        # attributes, all but the first are ignored."
        break
```
(If you are parsing XHTML then in theory you ought to take into account the rather hairy XML Base specification instead. But you can probably get away without worrying about that, since no-one really uses XHTML.)
0 讨论(0)
发布评论:

提交评论
- 加载中...

被撕碎了的回忆

2020-12-30 23:50

very simple:

>>> from urlparse import urljoin
>>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png')
'http://mysite.com/images/img.png'

0 讨论(0)

热议问题