C#: Cycle through encodings

后端 未结 6 1478
小蘑菇
小蘑菇 2020-12-16 05:42

I am reading files in various formats and languages and I am currently using a small encoding library to take attempt to detect the proper encoding (http://www.codeproject.c

相关标签:
6条回答
  • 2020-12-16 06:17

    Beware of the infamous 'Notepad bug'. It's going to bite you whatever you try, though... You can find some good discussions about encodings and their challenges on MSDN (and other places).

    0 讨论(0)
  • 2020-12-16 06:28

    You have to keep the original data as a byte array or MemoryStream you can then translate to the new encoding, once you already converted your data to a string you can't reliably return to the original representation.

    0 讨论(0)
  • 2020-12-16 06:35

    Could you let the user enter some words (with "special" characters) that are supposed to occur in the file?

    You can search all encodings yourself to see if these words are present.

    0 讨论(0)
  • 2020-12-16 06:36

    Read the file as bytes and use then the Encoding.GetString Method.

            byte[] data = System.IO.File.ReadAllBytes(path);
    
            Console.WriteLine(Encoding.UTF8.GetString(data));
            Console.WriteLine(Encoding.UTF7.GetString(data));
            Console.WriteLine(Encoding.ASCII.GetString(data));
    

    So you have to load the file only one time. You can use every encoding based on the original bytes of the file. The user can select the correct one und you can use the result of Encoding.GetEncoding(...).GetString(data) for further processing.

    0 讨论(0)
  • 2020-12-16 06:39

    (removed original answer following question update)

    For instance, if there was a method that would read a string and return it using a different encoding, something like "render(string, encoding)".

    I don't think you can re-use the string data. The fact is: if the encoding was wrong, this string can be considered corrupt. It may very easily contain gibberish among the likely looking characters. In particular, many encodings may forgive the presence/absence of a BOM/preamble, but would you re-encode with it? without it?

    If you are happy to risk it (I wouldn't be), you could just re-encode your local string with the last encoding:

    // I DON'T RECOMMEND THIS!!!!
    byte[] preamble = lastEncoding.GetPreamble(),
        content = lastEncoding.GetBytes(text);
    byte[] raw = new byte[preamble.Length + content.Length];
    Buffer.BlockCopy(preamble, 0, raw, 0, preamble.Length);
    Buffer.BlockCopy(content, 0, raw, preamble.Length, content.Length);
    text = nextEncoding.GetString(raw);
    

    In reality, I believe the best you can do is to keep the original byte[] - keep offering different renderings (via different encodings) until they like one. Something like:

    using System;
    using System.IO;
    using System.Text;
    using System.Windows.Forms;
    class MyForm : Form {
        [STAThread]
        static void Main() {
            Application.EnableVisualStyles();
            Application.Run(new MyForm());
        }
        ComboBox encodings;
        TextBox view;
        Button load, next;
        byte[] data = null;
    
        void ShowData() {
            if (data != null && encodings.SelectedIndex >= 0) {
                try {
                    Encoding enc = Encoding.GetEncoding(
                        (string)encodings.SelectedValue);
                    view.Text = enc.GetString(data);
                } catch (Exception ex) {
                    view.Text = ex.ToString();
                }
            }
        }
        public MyForm() {
            load = new Button();
            load.Text = "Open...";
            load.Dock = DockStyle.Bottom;
            Controls.Add(load);
    
            next = new Button();
            next.Text = "Next...";
            next.Dock = DockStyle.Bottom;
            Controls.Add(next);
    
            view = new TextBox();
            view.ReadOnly = true;
            view.Dock = DockStyle.Fill;
            view.Multiline = true;
            Controls.Add(view);
    
            encodings = new ComboBox();
            encodings.Dock = DockStyle.Bottom;
            encodings.DropDownStyle = ComboBoxStyle.DropDown;
            encodings.DataSource = Encoding.GetEncodings();
            encodings.DisplayMember = "DisplayName";
            encodings.ValueMember = "Name";
            Controls.Add(encodings);
    
            next.Click += delegate { encodings.SelectedIndex++; };
    
            encodings.SelectedValueChanged += delegate { ShowData(); };
    
            load.Click += delegate {
                using (OpenFileDialog dlg = new OpenFileDialog()) {
                    if (dlg.ShowDialog(this)==DialogResult.OK) {
                        data = File.ReadAllBytes(dlg.FileName);
                        Text = dlg.FileName;
                        ShowData();
                    }
                }
            };
        }
    }
    
    0 讨论(0)
  • 2020-12-16 06:39

    How about something like this:

    public string LoadFile(string path)
    {
        stream = GetMemoryStream(path);     
        string output = TryEncoding(Encoding.UTF8);
    }
    
    public string TryEncoding(Encoding e)
    {
        stream.Seek(0, SeekOrigin.Begin) 
        StreamReader reader = new StreamReader(stream, e);
        return reader.ReadToEnd();
    }
    
    private MemoryStream stream = null;
    
    private MemorySteam GetMemoryStream(string path)
    {
        byte[] buffer = System.IO.File.ReadAllBytes(path);
        return new MemoryStream(buffer);
    }
    

    Use LoadFile on your first try; then use TryEncoding subsequently.

    0 讨论(0)
提交回复
热议问题