Fix encoding error with loop in BeautifulSoup4?

问题

This is a follow up to Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4? and Using Python to Scrape Nested Divs and Spans in Twitter?.

I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.

EDIT: The error described here only occurs in Windows 7. The code runs as intended on Linux, as reported by bernie, see comment below, and I am able to run it without encoding errors on OSX 10.10.2.

The encoding error occurs when I try to loop the code that scrapes the content of the tweet.

This first snippet scrapes only the first tweet and gets everything in the <p> tags, as intended.

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

However, when I attempt to use a loop to scrape all the tweets using this second snippet,

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts]

I get this well known cp437.py encoding error.

File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in     position 4052: character maps to <undefined>

So why is it that the code for the first tweet is successfully scraped, but multiple tweets are causing encoding problems? Is it just because the first tweet happens to include no problematic characters? I've tried scraping the first tweet on several different searches successfully, so I'm not sure if that is the cause.

How do I go about fixing this? I've read a few posts and book sections about this, and I understand why it happens, but I am not sure how to correct it within the BeautifulSoup code.

Here is the complete code for reference.

from bs4 import BeautifulSoup
import requests
import sys
import csv #Will be exporting to csv

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0'} # (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents[0] for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]   
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [('http://www.twitter.com')+permalink for permalink in urls] 

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", amessage, "\n", "\n", messages)

回答1:

I've solved this to my own satisfaction by eliminating the print statements that I was using for error checking and specifying encoding for the HTML file being scraped and the csv output file by adding encoding="utf-8" to both with open commands.

from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone

url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
    soup = BeautifulSoup(infile, "html.parser")

#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]  
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'}) 
messages = [messagetext for messagetext in messagetexts]  

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]

#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)

rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)

rownew = list(rows)

#print (rownew)

newfile = input("Enter a filename for the table:") + ".csv"

with open(newfile, 'w', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=",")
    writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
    for row in rownew:
        writer.writerow(row)

来源：https://stackoverflow.com/questions/35051723/fix-encoding-error-with-loop-in-beautifulsoup4

标签

python

twitter

web-scraping

beautifulsoup