BeautifulSoup output to .txt file

前端 未结 2 1445
情深已故
情深已故 2021-01-14 11:41

I am trying to export my data as a .txt file

from bs4 import BeautifulSoup
import requests
import os

import os

os.getcwd()
\'/home/folder\'
os.mkdir(\"Prob         


        
相关标签:
2条回答
  • 2021-01-14 12:05

    I was working on a webscraping project, and this issue gave me tons of problems. I tried almost every solution out there that dealt with Python encoding (convert to UTF using string.encode(), convert to ASCII, convert using the 'unicodedata' module, use .decode() and then .encode(), blood sacrifice to Tim Peters, etc etc).

    None of the solutions worked all the time, which struck me as very un-Pythonic.

    So what I ended up using was the following:

    html = bs.prettify()  #bs is your BeautifulSoup object
    with open("out.txt","w") as out:
        for i in range(0, len(html)):
            try:
                out.write(html[i])
            except Exception:
                1+1
    

    It's not perfect, but it gave me the best results. When I opened it in a browser, it was able to parse the page properly almost every time.

    0 讨论(0)
  • 2021-01-14 12:15

    You should put Inside file.write your content. I'll probably do something like:

    #!/usr/bin/python3
    #
    
    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
    file_name=url.rsplit('/',1)[1].rsplit('.')[0]
    
    r  = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    data = soup.find_all('article', {'class': 'article'})
    
    
    content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
                                                   item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
                                                   item.contents[0].find_all('img', {'class': 'media-full'})[0],
                                                   item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
                                                 ) for item in data)
    
    with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
        file.write(content)
    
    0 讨论(0)
提交回复
热议问题