Extracting contents from specific meta tags that are not closed using BeautifulSoup

前端未结

关注

 6  1335

I\'m trying to parse out content from specific meta tags. Here\'s the structure of the meta tags. The first two are closed with a backslash, but the rest don\'t have any clo

相关标签:

6条回答

轻奢々

2020-12-28 10:05

Edited: Added regex for case sensitivity as suggested by @Albert Chen.

Python 3 Edit:

from bs4 import BeautifulSoup
import re
import urllib.request

page3 = urllib.request.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'])

Although I'm not sure it will work for every page:

from bs4 import BeautifulSoup
import re
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'].encode('utf-8'))

Yields:

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.

0 讨论(0)

刺人心

2020-12-28 10:06
As suggested by ingo you could use a less strict parser like html5.
```
soup3 = BeautifulSoup(page3, 'html5lib')
```
but be sure to have python-html5lib parser available on the system.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-12-28 10:13
Try (based on this blog post)
```
from bs4 import BeautifulSoup
...
desc = ""
for meta in soup.findAll("meta"):
    metaname = meta.get('name', '').lower()
    metaprop = meta.get('property', '').lower()
    if 'description' == metaname or metaprop.find("description")>0:
        desc = meta['content'].strip()
```
Tested against the following variants:
- <meta name="description" content="blah blah" /> (Example)
- <meta id="MetaDescription" name="DESCRIPTION" content="blah blah" /> (Example)
- <meta property="og:description" content="blah blah" /> (Example)
Used BeautifulSoup version 4.4.1
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-12-28 10:14
```
soup3 = BeautifulSoup(page3, 'html5lib')
```
xhtml requires the meta tag to be closed properly, html5 does not. The html5lib parser is more "permissive".
0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2020-12-28 10:19

I think here use regexp should be better: example:

resp = requests.get('url')
soup = BeautifulSoup(resp.text)
desc = soup.find_all(attrs={"name": re.compile(r'Description', re.I)})

0 讨论(0)

时光说笑

2020-12-28 10:29

Description is Case-Sensitive.So, we need to look for both 'Description' and 'description'.

Case1: 'Description' in Flipkart.com

Case2: 'description' in Snapdeal.com

from bs4 import BeautifulSoup
import requests

url= 'https://www.flipkart.com'
page3= requests.get(url)
soup3= BeautifulSoup(page3.text)
desc= soup3.find(attrs={'name':'Description'})
if desc == None:
    desc= soup3.find(attrs={'name':'description'})
try:
    print desc['content']
except Exception as e:
    print '%s (%s)' % (e.message, type(e))

0 讨论(0)