Encoding problems with dBase III .dbf files on different machines

后端 未结 4 417
一向
一向 2020-12-20 10:09

I\'m using C# and .NET 3.5, trying to import some data from old dbf files using ODBC with Microsoft dBase Driver.

The dbf\'s are in dBase III format and using ibm850

相关标签:
4条回答
  • 2020-12-20 10:36

    When you read dbf file you should understand that you should take into account 3 types of encoding:

    1.Encoding in which database provider reads the file. It depends on provider and current operation system. This encoding shall be used for bytes array receiving. For example on my PC:

    • when I use connection string "Data Source={0}; Provider=Microsoft.JET.OLEDB.4.0;Extended Properties=DBase IV;User ID=;Password=;", strings are read using 866 code page (Russian MS-DOS)

    • when I use connection string "Data Source={0}; Provider=vfpoledb.1;Exclusive=No;Collating Sequence=Machine", strings are read using Encoding.Default (1251 code page)

    2.Encoding in which strings are written to dbf file. It can be received from 29 byte of dbf file, but in fact there is no matter what how dbf file encoding is marked, you should just know what encoding was used. This encoding shall be used as source encoding during string conversion

    3.Encoding to which string shall be converted. This is UTF-8 usually.

    So string conversion should look like this:

    byte[] bytes = Encoding.GetEncoding(codePage1).GetBytes(reader.GetString(0));
    
    string result = Encoding.UTF8.GetString((Encoding.Convert(Encoding.GetEncoding(codePage2), Encoding.UTF8, bytes)));
    
    0 讨论(0)
  • 2020-12-20 10:36

    Have you tried using the Visual Foxpro driver "VFPOleDb" driver instead???

    0 讨论(0)
  • 2020-12-20 10:40

    If you are still having a problem with these files, I may be able to help you.

    What is in the "codepage byte" aka "language driver id" (LDID) at offset 29 (decimal) in the file?

    I have a Python-based DBF reader which can read just about any field data type and just about any codepage -- it has a long list compiled from various sources of mappings from codepage byte to codepage number. Options are (1) believe the LDID, deliver Unicode (2) ignore the LDID, deliver undecoded bytes (3) override the LDID, decode with a specific codepage into Unicode. The Unicode can of course be then encoded into UTF-8.

    The DBF reader also does a whole lot of reasonableness cross-checks which may help investigating why VFP thinks the file is corrupt.

    How do you know that it's using IBM850? Another piece of Python code that I have is a prototype encoding detector, which unlike detectors like 'chardet' which are derived from Mozilla code is not web-centric and can happily recognise most old DOS codepages -- this may help.

    A observation: the Greek letter lowercase sigma (σ) is 0xE5 in codepage 437, which was succeded by codepage 850 -- "pc2" seems a little outdated ...

    If you think I can be of any help, feel free to e-mail me at insert_punctuation("sjmachin", "lexicon", "net")

    0 讨论(0)
  • 2020-12-20 10:58

    Try this code.

    var oConn = new System.Data.Odbc.OdbcConnection();
    oConn.ConnectionString = "Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=" + dbPath;
    oConn.Open();
    var oCmd = oConn.CreateCommand();
    oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";
    var reader = oCmd.ExecuteReader();
    reader.Read(); 
    byte[] A = Encoding.GetEncoding(Encoding.Default.CodePage).GetBytes(reader.GetString(0));
    string p = Encoding.Unicode.GetString((Encoding.Convert(Encoding.GetEncoding(850), Encoding.Unicode, A)));
    
    0 讨论(0)
提交回复
热议问题