Don't put html, head and body tags automatically, beautifulsoup

前端未结

关注

 8  1474

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup(\'FOO\', \'html5lib\') # => <


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2020-12-03 10:13
              
            
            
                                                                       
If you want it to look better, try this:


  BeautifulSoup([contents you want to analyze].prettify())

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-12-03 10:17
              
            
            
                                                                       
Your only option is to not use html5lib to parse the data.

That's a feature of the html5lib library, it fixes HTML that is lacking, such as adding back in missing required elements.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2020-12-03 10:17
              
            
            
                                                                       
This aspect of BeautifulSoup has always annoyed the hell out of me.

Here's how I deal with it:

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')

# Do stuff here

# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])


A quick breakdown:

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children

# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)

# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]

# Join all the string objects together to recreate your original html
"".join()


I still don't like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.

Hopefully, the next time I Google this, I'll find my answer here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-12-03 10:23
              
            
            
                                                                       
Let's first create a soup sample:

soup=BeautifulSoup("<head></head><body><p>content</p></body>")


You could get html and body's child by specify soup.body.<tag>:

# python3: get body's first child
print(next(soup.body.children))

# if first child's tag is rss
print(soup.body.rss)


Also you could use unwrap() to remove body, head, and html

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()


If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-03 10:28
              
            
            
                                                                       
In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>


This parses the HTML with Python's builtin HTML parser.
Quoting the docs:


  Unlike html5lib, this parser makes no attempt to create a well-formed
  HTML document by adding a <body> tag. Unlike lxml, it doesn’t even
  bother to add an <html> tag.




Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]: soup.body.next
Out[62]: <h1>FOO</h1>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2020-12-03 10:31
              
            
            
                                                                       
Here is how I do it
a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复