Storing a string as UTF8 in C#

前端 未结 4 733
忘了有多久
忘了有多久 2021-02-01 14:30

I\'m doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory

相关标签:
4条回答
  • 2021-02-01 14:49

    As you've found, the CLR uses UTF-16 for character encoding. Your best bet may be to use the Encoding classes & a BitConverter to handle the text. This question has some good examples for converting between the two encodings:

    Convert String (UTF-16) to UTF-8 in C#

    0 讨论(0)
  • 2021-02-01 15:05

    Not really. System.String is designed for storing strings. Your requirement is for a very particular subset of strings with particular memory benefits.

    Now, "very particular subset of strings with particular memory benefits" comes up a lot, but not always the same very particular subset. Code that is ASCII-only isn't for reading by human beings, so it tends to be either short codes, or something that can be handled in a stream-processing manner, or else chunks of text merged in with bytes doing other jobs (e.g. quite a few binary formats will have small bits that translate directly to ASCII).

    As such, you've a pretty strange requirement.

    All the more so when you come to the gigabytes part. If I'm dealing with gigs, I'm immediately thinking about how I can stop having to deal with gigs, and/or get much more serious savings than just 50%. I'd be thinking about mapping chunks I'm not currently interested in to a file, or about ropes, or about a bunch of other things. Of course, those are going to work for some cases and not for all, so yet again, we're not talking about something where .NET should stick in something as a one-size-fits-all, because one size will not fit all.

    Beyond that, just the utf-8 bit isn't that hard. It's all the other methods that becomes work. Again, what you need there won't be the same as someone else.

    0 讨论(0)
  • 2021-02-01 15:08

    Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

    var utf8 = Encoding.UTF8;
    byte[] utfBytes = utf8.GetBytes(myString);
    
    var myReturnedString = utf8.GetString(utfBytes);
    
    0 讨论(0)
  • 2021-02-01 15:10

    As I can see your problem is that char in C# is occupying 2 bytes, instead of one.

    One way to read a text file is to open it with :

        System.IO.FileStream fs = new System.IO.FileStream(file, System.IO.FileMode.Open);
        System.IO.BinaryReader br = new System.IO.BinaryReader(fs);
    
        byte[] buffer = new byte[1024];
        int read = br.Read(buffer, 0, (int)fs.Length);
    
        br.Close();
        fs.Close(); 
    

    And this way you are reading the bytes from the file. I tried it with *.txt files encoded in UTF-8 that is 2 bytes per char, and ANSI that is 1 byte per char.

    0 讨论(0)
提交回复
热议问题