using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup(\'FOO
\', \'html5lib\') # => <
If you want it to look better, try this:
BeautifulSoup([contents you want to analyze].prettify())
Your only option is to not use html5lib
to parse the data.
That's a feature of the html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.
This aspect of BeautifulSoup has always annoyed the hell out of me.
Here's how I deal with it:
# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')
# Do stuff here
# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])
A quick breakdown:
# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children
# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)
# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]
# Join all the string objects together to recreate your original html
"".join()
I still don't like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.
Hopefully, the next time I Google this, I'll find my answer here.
Let's first create a soup sample:
soup=BeautifulSoup("<head></head><body><p>content</p></body>")
You could get html and body's child by specify soup.body.<tag>
:
# python3: get body's first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)
Also you could use unwrap() to remove body, head, and html
soup.html.body.unwrap()
if soup.html.select('> head'):
soup.html.head.unwrap()
soup.html.unwrap()
If you load xml file, bs4.diagnose(data)
will tell you to use lxml-xml
, which will not wrap your soup with html+body
>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>
This parses the HTML with Python's builtin HTML parser. Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a
<body>
tag. Unlike lxml, it doesn’t even bother to add an<html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Here is how I do it
a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>