I am trying to export my data as a .txt file
from bs4 import BeautifulSoup
import requests
import os
import os
os.getcwd()
\'/home/folder\'
os.mkdir(\"Prob
I was working on a webscraping project, and this issue gave me tons of problems. I tried almost every solution out there that dealt with Python encoding (convert to UTF using string.encode(), convert to ASCII, convert using the 'unicodedata' module, use .decode() and then .encode(), blood sacrifice to Tim Peters, etc etc).
None of the solutions worked all the time, which struck me as very un-Pythonic.
So what I ended up using was the following:
html = bs.prettify() #bs is your BeautifulSoup object
with open("out.txt","w") as out:
for i in range(0, len(html)):
try:
out.write(html[i])
except Exception:
1+1
It's not perfect, but it gave me the best results. When I opened it in a browser, it was able to parse the page properly almost every time.
You should put Inside file.write
your content. I'll probably do something like:
#!/usr/bin/python3
#
from bs4 import BeautifulSoup
import requests
url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
file_name=url.rsplit('/',1)[1].rsplit('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = soup.find_all('article', {'class': 'article'})
content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
item.contents[0].find_all('img', {'class': 'media-full'})[0],
item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
) for item in data)
with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
file.write(content)