I am doing some scripts in python. I create a string that I save in a file. This string got lot of data, coming from the arborescence and filenames of a directory. According
You don't need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode
before it can encode it back to UTF-8. That is what is failing here:
>>> data = u'\u00c3' # Unicode data
>>> data = data.encode('utf8') # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8') # Try to *re*-encode it
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Just write your data directly to the file, there is no need to encode already-encoded data.
If you instead build up unicode
values instead, you would indeed have to encode those to be writable to a file. You'd want to use codecs.open() instead, which returns a file object that will encode unicode values to UTF-8 for you.
You also really don't want to write out the UTF-8 BOM, unless you have to support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad).
For your MySQL insert problem, you need to do two things:
Add charset='utf8'
to your MySQLdb.connect()
call.
Use unicode
objects, not str
objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you:
artiste = artiste.decode('utf8') # it is already UTF8, decode to unicode
c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
# ...
c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
It may actually work better if you used codecs.open()
to decode the contents automatically instead:
import codecs
sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')
with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
for line in findex:
if u'#artiste' not in line:
continue
artiste=line.split(u'[:::]')[1].strip()
cursor = sql.cursor()
cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
if not cursor.fetchone()[0]:
cursor = sql.cursor()
cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
artists_inserted += 1
You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Unfortunately, the string.encode() method is not always reliable. Check out this thread for more information: What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python