问题
I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution.
Here is the code :
from bs4 import BeautifulSoup
import requests
url ="https://en.wikipedia.org/wiki/WKIK"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a', href=True):
print "Found the URL:", link['href']
Here is a part of what it returns :
>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0
回答1:
When you are taking links from element, href attribute .You will almost always get link like /wiki/Main_Page.
Because the base url is always the same 'https://en.wikipedia.org'. So what you need is to do is:
base_url = 'https://en.wikipedia.org'
search_url ="https://en.wikipedia.org/wiki/WKIK"
r = requests.get(search_url)
data = r.content
soup = BeautifulSoup(data)
for link in soup.find_all('a', href=True):
print ("Found the URL:", link['href'])
if link['href'] != '#' and link['href'].strip() != '':
final_url = base_url + link['href']
回答2:
maybe something like this will suit you:
for link in soup.find_all('a', href=True):
if 'en.wikipedia.org' not in link['href']:
print("Found the URL:", 'https://en.wikipedia.org'+link['href'])
elif 'http' not in link['href']:
print("Found the URL:", 'https://'+link['href'])
else:
print("Found the URL:", link['href'])
回答3:
The other answers here may run into issues with certain relative URLs, such as ones that include periods (../page
).
Python's requests
library has a function called urljoin to get the full URL:
requests.compat.urljoin(currentPage, link)
So if you're on https://en.wikipedia.org/wiki/WKIK
and there's a link on the page with an href
of /wiki/Main_Page
, that function would return https://en.wikipedia.org/wiki/Main_Page
.
来源:https://stackoverflow.com/questions/44746021/how-to-get-full-web-address-with-beautifulsoup