How to extract html links with a matching word from a website using python

后端 未结 1 965
野趣味
野趣味 2021-01-14 06:29

I have an url, say http://www.bbc.com/news/world/asia/. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case in

相关标签:
1条回答
  • 2021-01-14 06:49

    You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

    from bs4 import BeautifulSoup
    import requests
    
    url = "http://www.bbc.com/news/world/asia/"
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    
    india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                               'href' in tag.attrs and
                               'india' in tag.get_text().lower())
    results = soup.find_all(india_links)
    

    The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

    Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

    Demo:

    >>> from bs4 import BeautifulSoup
    >>> import requests
    >>> url = "http://www.bbc.com/news/world/asia/"
    >>> r = requests.get(url)
    >>> soup = BeautifulSoup(r.content)
    >>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
    >>> results = soup.find_all(india_links)
    >>> from pprint import pprint
    >>> pprint(results)
    [<a href="/news/world/asia/india/">India</a>,
     <a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
     <a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
     <a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
     <a href="/news/world/asia/india/">India</a>,
     <a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
     <a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
     <a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
     <a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
     <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
    

    Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

    A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

    You can extract just the links with:

    from urllib.parse import urljoin
    
    result_links = [urljoin(url, tag['href']) for tag in results]
    

    where all relative URLs are resolved relative to the original URL:

    >>> from urllib.parse import urljoin
    >>> result_links = [urljoin(url, tag['href']) for tag in results]
    >>> pprint(result_links)
    ['http://www.bbc.com/news/world/asia/india/',
     'http://www.bbc.com/news/world-asia-india-30647504',
     'http://www.bbc.com/news/world-asia-india-30640444',
     'http://www.bbc.com/news/world-asia-india-30640436',
     'http://www.bbc.com/news/world/asia/india/',
     'http://www.bbc.com/news/world-asia-india-30630274',
     'http://www.bbc.com/news/world-asia-india-30632852',
     'http://www.bbc.com/sport/0/cricket/30632182',
     'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
     'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']
    
    0 讨论(0)
提交回复
热议问题