I\'m scraping some webpages using selenium and beautifulsoup. I\'m iterating through a bunch of links, grabbing info, and then dumping it into a JSON:
for e
You might need to set PYTHONIOENCODING before running your python script in the shell. For example, I got the same error while redirecting the python script output into a log file:
$ your_python_script > output.log
'ascii' codec can't encode characters in position xxxxx-xxxxx: ordinal not in range(128)
After changing PYTHONIOENCODING to UTF8 in the shell, script executed with no ASCII codec error:
$ export PYTHONIOENCODING=utf8
$ your_python_script > output.log
Your problem is that, in Python 2, a file
object (as returned by open()
) can only write str
objects, not unicode
objects. Passing ensure_ascii=False
to json.dump()
makes it attempt to write Unicode strings to the file directly as unicode
objects, which will fail.
json.dump(item, writeJSON, ensure_ascii=False).encode('utf-8')
This attempted fix doesn't work because json.dump()
doesn't return anything; instead, it writes content directly to the file. (If there weren't any Unicode text in item
, this would crash after json.dump()
completed -- json.dump()
returns None, which can't have .encode()
called on it.)
There's three ways to go about fixing this:
Use Python 3. The unification of str
and unicode
in Python 3 makes your existing code work as-is; no code changes are necessary.
Remove ensure_ascii=False
from your call to json.dump
. Non-ASCII characters will be written to the file in escaped form -- for instance, ï
will be written as \u00ef
. This is a perfectly valid way of representing Unicode characters, and most JSON libraries will handle it just fine.
Wrap the file
object in a UTF-8 StreamWriter
:
import codecs
with codecs.getwriter("utf8")(open("testScrape.json", "w")) as writeJSON:
json.dump(item, writeJSON, ensure_ascii=False)