I\'m writing a Python script to process some music data. It\'s supposed to merge two separate databases by comparing their entries and matching them up. It\'s almost working, bu
To find out whether YOU (not it) are storing your strings as str
objects or unicode
objects, print type(your_string)
.
You can use print repr(your_string)
to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii()
instead of repr()
.
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
string.decode('latin1').encode('utf8')
and see what happens.