Tracking down implicit unicode conversions in Python 2

问题

I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:

someDynamicStr = "bar" # could come from various sources

# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

someDynamicStr = "\xff" # uh-oh

# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

(Possibly other forms as well.)

Now I would like to track down those usages, especially those in actively used code.

It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).

/edit:

While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).

I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):

import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively

(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)

/edit #2:

For explicit conversions I made a snippet with the help of a SO post on monkeypatching:

class PatchedUnicode(unicode):
  def __init__(self, obj=None, encoding=None, *args, **kwargs):
    if encoding in (None, "ascii", "646", "us-ascii"):
        print("Problematic unicode() usage detected!")
    super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)

import __builtin__
__builtin__.unicode = PatchedUnicode

This only affects explicit conversions using the unicode() constructor directly so it's not something I need.

/edit #3:

The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).

/edit #4:

It's nice to see many good answers here, too bad I can only give out the bounty once.

In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs? Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).

On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.

/edit #5

I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi

回答1:

You can register a custom encoding which prints a message whenever it's used:

Code in ourencoding.py:

import sys
import codecs
import traceback

# Define a function to print out a stack frame and a message:

def printWarning(s):
    sys.stderr.write(s)
    sys.stderr.write("\n")
    l = traceback.extract_stack()
    # cut off the frames pointing to printWarning and our_encode
    l = traceback.format_list(l[:-2])
    sys.stderr.write("".join(l))

# Define our encoding:

originalencoding = sys.getdefaultencoding()

def our_encode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.encode(s, originalencoding, errors), len(s))

def our_decode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.decode(s, originalencoding, errors), len(s))

def our_search(name):
    if name == 'our_encoding':
        return codecs.CodecInfo(
            name='our_encoding',
            encode=our_encode,
            decode=our_decode);
    return None

# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')

If you import this file at the start of our script, then you'll see warnings for implicit conversions:

#!python2
# coding: utf-8

import ourencoding

print("test 1")
a = "hello " + u"world"

print("test 2")
a = "hello ☺ " + u"world"

print("test 3")
b = u" ".join(["hello", u"☺"])

print("test 4")
c = unicode("hello ☺")

output:

test 1
test 2
Default encoding used
 File "test.py", line 10, in <module>
   a = "hello ☺ " + u"world"
test 3
Default encoding used
 File "test.py", line 13, in <module>
   b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
 File "test.py", line 16, in <module>
   c = unicode("hello ☺")

It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.

回答2:

What you can do is the following:

First create a custom encoding. I will call it "lascii" for "logging ASCII":

import codecs
import traceback

def lascii_encode(input,errors='strict'):
    print("ENCODED:")
    traceback.print_stack()
    return codecs.ascii_encode(input)


def lascii_decode(input,errors='strict'):
    print("DECODED:")
    traceback.print_stack()
    return codecs.ascii_decode(input)

class Codec(codecs.Codec):
    def encode(self, input,errors='strict'):
        return lascii_encode(input,errors)
    def decode(self, input,errors='strict'):
        return lascii_decode(input,errors)

class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        print("Incremental ENCODED:")
        traceback.print_stack()
        return codecs.ascii_encode(input)

class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        print("Incremental DECODED:")
        traceback.print_stack()
        return codecs.ascii_decode(input)

class StreamWriter(Codec,codecs.StreamWriter):
    pass

class StreamReader(Codec,codecs.StreamReader):
    pass

def getregentry():
    return codecs.CodecInfo(
        name='lascii',
        encode=lascii_encode,
        decode=lascii_decode,
        incrementalencoder=IncrementalEncoder,
        incrementaldecoder=IncrementalDecoder,
        streamwriter=StreamWriter,
        streamreader=StreamReader,
    )

What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.

Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:

def searchFunc(name):
    if name=="lascii":
        return getregentry()
    else:
        return None

codecs.register(searchFunc)

The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:

import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')

Warning: This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.

回答3:

Just add:

from __future__ import unicode_literals

at the beginning of your source code files - it has to be the first import and it has to be in all source code files affected and the headache of using unicode in Python-2.7 goes away. If you didn't do anything super weird with strings then it should get rid of the problem out of the box.
Check out the following Copy&Paste from my console - I tried with the sample from your question:

user@linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> someDynamicStr = "bar" # could come from various sources

>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
">>> u"foo{}".format(someDynamicStr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>>

And now with __future__ magic:

user@linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
u'foo\xff'
>>> u"foo{}".format(someDynamicStr)
u'foo\xff'
>>>

回答4:

I see you have a lot of edits relating to solutions you may have encountered. I'm just going to address your original post which I believe to be: "I want to create a wrapper around the unicode constructor that checks input".

The unicode method is part of Python's standard library. You will decorate the unicode method to add checks to the method.

def add_checks(fxn):
    def resulting_fxn(*args, **kargs):
        # this is where whether the input is of type str
        if type(args[0]) is str:
            # do something
        # this is where the encoding/errors parameters are set to the default values
        encoding = 'utf-8'

        # Set default error behavior
        error = 'ignore'

        # Print any information (i.e. traceback)
        # print 'blah'
        # TODO: for traceback, you'll want to use the pdb module
        return fxn(args[0], encoding, error)
    return resulting_fxn

Using this will look like this:

unicode = add_checks(unicode)

We overwrite the existing function name so that you don't have to change all the calls in the large project. You want to do this very early on in the runtime so that subsequent calls have the new behavior.

来源：https://stackoverflow.com/questions/39662847/tracking-down-implicit-unicode-conversions-in-python-2

标签

python

python-2.7

debugging

unicode

monkeypatching