问题
What I want to do: extract text information from a pdf file and redirect that to a txt file.
What I did:
pip install pdfminor
pdf2txt.py file.pdf > output.txt
What I got:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence
My observation:
\u2022
is bullet point, •
.
pdf2txt.py
works well without redirection: the bullet point character is written to stdout without any error.
My question:
Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.
How can I fix this error? I cannot do any modification to pdf2txt.py
as it's not my code.
回答1:
Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character •
using the GBK codec. This probably means you're using a Chinese version of Windows.
A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.
You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.
set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt
回答2:
You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.
Making the encoding parameter more explicit is probably what you want.
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)
来源:https://stackoverflow.com/questions/59779618/unicodeencodeerror-in-python3-when-redirection-is-used