Extracting contents from specific meta tags that are not closed using BeautifulSoup

前端 未结 6 1335
孤街浪徒
孤街浪徒 2020-12-28 09:34

I\'m trying to parse out content from specific meta tags. Here\'s the structure of the meta tags. The first two are closed with a backslash, but the rest don\'t have any clo

相关标签:
6条回答
  • 2020-12-28 10:05

    Edited: Added regex for case sensitivity as suggested by @Albert Chen.

    Python 3 Edit:

    from bs4 import BeautifulSoup
    import re
    import urllib.request
    
    page3 = urllib.request.urlopen("https://angel.co/uber").read()
    soup3 = BeautifulSoup(page3)
    
    desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
    print(desc[0]['content'])
    

    Although I'm not sure it will work for every page:

    from bs4 import BeautifulSoup
    import re
    import urllib
    
    page3 = urllib.urlopen("https://angel.co/uber").read()
    soup3 = BeautifulSoup(page3)
    
    desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
    print(desc[0]['content'].encode('utf-8'))
    

    Yields:

    Learn about Uber's product, founders, investors and team. Everyone's Private Dri
    ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
    s. Within minutes, a professional driver in a sleek black car will arrive curbsi
    de. Automatically charged to your credit card on file, tip included.
    
    0 讨论(0)
  • 2020-12-28 10:06

    As suggested by ingo you could use a less strict parser like html5.

    soup3 = BeautifulSoup(page3, 'html5lib')
    

    but be sure to have python-html5lib parser available on the system.

    0 讨论(0)
  • Try (based on this blog post)

    from bs4 import BeautifulSoup
    ...
    desc = ""
    for meta in soup.findAll("meta"):
        metaname = meta.get('name', '').lower()
        metaprop = meta.get('property', '').lower()
        if 'description' == metaname or metaprop.find("description")>0:
            desc = meta['content'].strip()
    

    Tested against the following variants:

    • <meta name="description" content="blah blah" /> (Example)
    • <meta id="MetaDescription" name="DESCRIPTION" content="blah blah" /> (Example)
    • <meta property="og:description" content="blah blah" /> (Example)

    Used BeautifulSoup version 4.4.1

    0 讨论(0)
  • 2020-12-28 10:14
    soup3 = BeautifulSoup(page3, 'html5lib')
    

    xhtml requires the meta tag to be closed properly, html5 does not. The html5lib parser is more "permissive".

    0 讨论(0)
  • 2020-12-28 10:19

    I think here use regexp should be better: example:

    resp = requests.get('url')
    soup = BeautifulSoup(resp.text)
    desc = soup.find_all(attrs={"name": re.compile(r'Description', re.I)})
    
    0 讨论(0)
  • 2020-12-28 10:29

    Description is Case-Sensitive.So, we need to look for both 'Description' and 'description'.

    Case1: 'Description' in Flipkart.com

    Case2: 'description' in Snapdeal.com

    from bs4 import BeautifulSoup
    import requests
    
    url= 'https://www.flipkart.com'
    page3= requests.get(url)
    soup3= BeautifulSoup(page3.text)
    desc= soup3.find(attrs={'name':'Description'})
    if desc == None:
        desc= soup3.find(attrs={'name':'description'})
    try:
        print desc['content']
    except Exception as e:
        print '%s (%s)' % (e.message, type(e))
    
    0 讨论(0)
提交回复
热议问题