using pyodbc on linux to insert unicode or utf-8 chars in a nvarchar mssql field

前端 未结 4 1019
抹茶落季
抹茶落季 2020-12-05 04:47

I am using Ubuntu 9.04

I have installed the following package versions:

unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
l         


        
相关标签:
4条回答
  • 2020-12-05 05:30

    Are you sure it's INSERT that's causing problem not reading? There's a bug open on pyodbc Problem fetching NTEXT and NVARCHAR data.

    0 讨论(0)
  • 2020-12-05 05:31

    I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.

    The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.

    You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?

    Collation has nothing to do with this problem :)

    See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.

    If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)

    The question here is "what happens when I give data to the server that is not Unicode?". For example:

    • When I send an UTF-8 string to the server, how does it understand it?
    • When I send an UTF-16 string to the server, how does it understand it?
    • When I send a Latin1 string to the server, how does it understand it?

    From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.

    It's exactly similar as doing in Python:

    uni = u'Hey my name is André'
    in_utf8 = uni.encode('utf-8')
    # send the utf-8 data to server
    # send(in_utf8)
    
    # on server side
    # server receives it. But server is Japanese.
    # So the server treats the data with the National charset, shift-jis:
    some_string = in_utf8 # some_string = receive()    
    decoded = some_string.decode('sjis')
    

    Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrテゥ". é gets replaced by Japanese テゥ

    Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.

    And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.

    FreeTDS handles Unicode for you.

    Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).

    • Server <--> FreeTDS : UCS2 data
    • FreeTDS <--> pyodbc : encoded strings, encoded in UTF-8 (from /etc/freetds/freetds.conf)

    So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.

    FreeTDS 0.82

    However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS


    Edited: removed old data, which had nothing to do with FreeTDS but was only relevant to Easysoft commercial odbc driver. Sorry.

    0 讨论(0)
  • 2020-12-05 05:40

    I use UCS-2 to interact with SQL Server, not UTF-8.

    Correction: I changed the .freetds.conf entry so that the client uses UTF-8

        tds version = 8.0
        client charset = UTF-8
        text size = 32768
    

    Now, bind values work fine for UTF-8 encoded strings. The driver converts transparently between the UCS-2 used for storage on the dataserver side and the UTF-8 encoded strings given to/taken from the client.

    This is with pyodbc 2.0 on Solaris 10 running Python 2.5 and FreeTDS freetds-0.82.1.dev.20081111 and SQL Server 2008

    
    import pyodbc
    test_string = u"""Comment ça va ? Très bien ?"""
    
    print type(test_string),repr(test_string)
    utf8 = 'utf8:' + test_string.encode('UTF-8')
    print type(utf8), repr(utf8)
    
    c = pyodbc.connect('DSN=SA_SQL_SERVER_TEST;UID=XXX;PWD=XXX')
    
    cur = c.cursor()
    # This does not work as test_string is not UTF-encoded
    try: 
        cur.execute('INSERT unicode_test(t) VALUES(?)', test_string)
        c.commit()
    except pyodbc.Error,e:
        print e
    
    
    # This one does:
    try:
        cur.execute('INSERT unicode_test(t) VALUES(?)', utf8)
        c.commit()
    except pyodbc.Error,e:
        print e    
    
    
    

    Here is the output from the test table (I had manually put in a bunch of test data via Management Studio)

    In [41]: for i in cur.execute('SELECT t FROM unicode_test'):
       ....:     print i
       ....:
       ....:
    ('this is not a banana', )
    ('\xc3\x85kergatan 24', )
    ('\xc3\x85kergatan 24', )
    ('\xe6\xb0\xb4 this is code-point 63CF', )
    ('Mich\xc3\xa9l', )
    ('Comment a va ? Trs bien ?', )
    ('utf8:Comment \xc3\xa7a va ? Tr\xc3\xa8s bien ?', )
    

    I was able to put in some in unicode code points directly into the table from Management Studio by the 'Edit Top 200 rows' dialog and entering the hex digits for the unicode code point and then pressing Alt-X

    0 讨论(0)
  • 2020-12-05 05:46

    I had the same problem when trying to bind unicode parameter: '[HY004] [FreeTDS][SQL Server]Invalid data type (0) (SQLBindParameter)'

    I solved it by upgrading freetds to version 0.91.

    I use pyodbc 2.1.11. I had to apply this patch to make it work with unicode, otherwise I was getting memory corruption errors occasionally.

    0 讨论(0)
提交回复
热议问题