BeautifulSoup output to .txt file

前端未结

关注

 2  1447

I am trying to export my data as a .txt file

from bs4 import BeautifulSoup
import requests
import os

import os

os.getcwd()
\'/home/folder\'
os.mkdir(\"Prob


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-01-14 12:05
              
            
            
                                                                       
I was working on a webscraping project, and this issue gave me tons of problems. I tried almost every solution out there that dealt with Python encoding (convert to UTF using string.encode(), convert to ASCII, convert using the 'unicodedata' module, use .decode() and then .encode(), blood sacrifice to Tim Peters, etc etc). 

None of the solutions worked all the time, which struck me as very un-Pythonic. 

So what I ended up using was the following:

html = bs.prettify()  #bs is your BeautifulSoup object
with open("out.txt","w") as out:
    for i in range(0, len(html)):
        try:
            out.write(html[i])
        except Exception:
            1+1


It's not perfect, but it gave me the best results. When I opened it in a browser, it was able to parse the page properly almost every time.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2021-01-14 12:15
              
            
            
                                                                       
You should put Inside file.write your content. I'll probably do something like:

#!/usr/bin/python3
#

from bs4 import BeautifulSoup
import requests

url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
file_name=url.rsplit('/',1)[1].rsplit('.')[0]

r  = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = soup.find_all('article', {'class': 'article'})


content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
                                               item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
                                               item.contents[0].find_all('img', {'class': 'media-full'})[0],
                                               item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
                                             ) for item in data)

with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
    file.write(content)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复