I have an url, say http://www.bbc.com/news/world/asia/
. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case in
You need to search for the word india
in the displayed text. To do this you'll need a custom function instead:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)
The india_links
lambda finds all tags that are <a>
links with an href
attribute and contain india
(case insensitive) somewhere in the displayed text.
Note that I used the requests
response object .content
attribute; leave decoding to BeautifulSoup!
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link here; we had to use the lambda
search because a search with a text
regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.
A similar problem applies to the /news/world-asia-india-30632852
link; the nested <span>
element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.
You can extract just the links with:
from urllib.parse import urljoin
result_links = [urljoin(url, tag['href']) for tag in results]
where all relative URLs are resolved relative to the original URL:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']