Can you change the contents of a (immutable) string via an unsafe method?

前端 未结 2 1211
醉酒成梦
醉酒成梦 2020-12-07 00:55

I know that strings are immutable and any changes to a string simply creates a new string in memory (and marks the old one as free). However, I\'m wondering if my logic belo

相关标签:
2条回答
  • 2020-12-07 01:25

    As others have pointed out, mutating the String objects is useful in some rare cases. I give an example with a useful code snippet below.

    Use-case/background

    Although everyone should be a huge fan of the really excellent character Encoding support that .NET has always offered, sometimes it might be preferable to cut down that overhead, especially if doing a lot of roundtripping between 8-bit (legacy) characters and managed strings (i.e. typically interop scenarios).

    As I hinted, .NET is particularly emphatic that you must explicitly specify a text Encoding for any/all conversions of non-Unicode character data to/from managed String objects. This rigorous control at the periphery is really commendable, since it ensures that once you have the string inside the managed runtime you never have to worry; everything is just wide Unicode. Even UTF-8 is largely banished in this pristine realm.

    (For contrast, recall a certain other popular scripting language that famously botched this whole area, eventually resulting in several years of parallel 2.x and 3.x versions, all due to extensive Unicode changes in the latter.)

    So .NET pushes all that mess to the interop boundary, enforcing Unicode (UTF-16) once you're inside, but this philosophy entails that the Encoding/Decoding work done ("once-and-for-all") be exhaustively rigorous, and because of this the .NET Encoding/Encoder classes can be a performance bottleneck. If you're moving lots of text from wide (Unicode) to simple fixed 7- or 8-bit narrow ANSI, ASCII, etc. (note I'm not talking about MBCS or UTF-8, where you'll want to use the Encoders!), the .NET encoding paradigm might seem like overkill.

    Furthermore, it could be the case that you don't know, or don't care to, specify an Encoding. Maybe all you care about is fast and accurate round-tripping for that low-byte of a 16-bit Char. If you look at the .NET source code, even the System.Text.ASCIIEncoding might be too bulky in some situations.


    The code snippet...

    Thin String: 8-bit characters directly stored in a managed String, one 'thin char' per wide Unicode character, without bothering with character encoding/decoding during round-tripping.

    All of these methods just ignore/strip the upper byte of each 16-bit Unicode character, transmitting only each low byte exactly as-is. Obviously, successful recovery of the Unicode text after a round-trip will only be possible if those upper bits aren't relevant.

    /// <summary> Convert byte array to "thin string" </summary>
    public static unsafe String ToThinString(this byte[] src)
    {
        int c;
        var ret = String.Empty;
        if ((c = src.Length) > 0)
            fixed (char* dst = (ret = new String('\0', c)))
                do
                    dst[--c] = (char)src[c];  // fill new String by in-situ mutation
                while (c > 0);
    
        return ret;
    }
    

    In the direction just shown, which is typically bringing native data in to managed, you often don't have the managed byte array, so rather than allocate a temporary one just for the purpose of calling this function, you can process the raw native bytes directly into a managed string. As before, this bypasses all character encoding.

    The (obvious) range checks that would be needed in this unsafe function are elided for clarity:

    public static unsafe String ToThinString(byte* pSrc, int c)
    {
        var ret = String.Empty;
        if (c > 0)
            fixed (char* dst = (ret = new String('\0', c)))
                do
                    dst[--c] = (char)pSrc[c];  // fill new String by in-situ mutation
                while (c > 0);
    
        return ret;
    }
    

    The advantage of String mutation here is that you avoid temporary allocations by writing directly to the final allocation. Even if you were to avoid the extra allocation by using stackalloc, there would be an unnecessary re-copying of the whole thing when you eventually call the String(Char*, int, int) constructor: clearly there's no way to associate data you just laboriously prepared with a String object that didn't exist until you were finished!


    For completeness...

    Here's the mirror-code which reverses operation to get back a byte array (even though this direction doesn't happen to illustrate the string-mutation technique). This is the direction you'd typically use to send Unicode text out of the managed .NET runtime, for use by a legacy app.

    /// <summary> Convert "thin string" to byte array </summary>
    public static unsafe byte[] ToByteArr(this String src)
    {
        int c;
        byte[] ret = null;
        if ((c = src.Length) > 0)
            fixed (byte* dst = (ret = new byte[c]))
                do
                    dst[--c] = (byte)src[c];
                while (c > 0);
    
        return ret ?? new byte[0];
    }
    
    0 讨论(0)
  • 2020-12-07 01:32

    Your example works just fine, thanks to several elements:

    • candidateString lives in the managed heap, so it's safe to modify. Compare this with baseString, which is interned. If you try to modify the interned string, unexpected things may happen. There's no guarantee that string won't live in write-protected memory at some point, although it seems to work today. That would be pretty similar to assigning a constant string to a char* variable in C and then modifying it. In C, that's undefined behavior.

    • You preallocate enough space in candidateString - so you're not overflowing the buffer.

    • Character data is not stored at offset 0 of the String class. It's stored at an offset equal to RuntimeHelpers.OffsetToStringData.

      public static int OffsetToStringData
      {
          // This offset is baked in by string indexer intrinsic, so there is no harm
          // in getting it baked in here as well.
          [System.Runtime.Versioning.NonVersionable] 
          get {
              // Number of bytes from the address pointed to by a reference to
              // a String to the first 16-bit character in the String.  Skip 
              // over the MethodTable pointer, & String 
              // length.  Of course, the String reference points to the memory 
              // after the sync block, so don't count that.  
              // This property allows C#'s fixed statement to work on Strings.
              // On 64 bit platforms, this should be 12 (8+4) and on 32 bit 8 (4+4).
      #if WIN32
              return 8;
      #else
              return 12;
      #endif // WIN32
          }
      }
      

      Except...

    • GCHandle.AddrOfPinnedObject is special cased for two types: string and array types. Instead of returning the address of the object itself, it lies and returns the offset to the data. See the source code in CoreCLR.

      // Get the address of a pinned object referenced by the supplied pinned
      // handle.  This routine assumes the handle is pinned and does not check.
      FCIMPL1(LPVOID, MarshalNative::GCHandleInternalAddrOfPinnedObject, OBJECTHANDLE handle)
      {
          FCALL_CONTRACT;
      
          LPVOID p;
          OBJECTREF objRef = ObjectFromHandle(handle);
      
          if (objRef == NULL)
          {
              p = NULL;
          }
          else
          {
              // Get the interior pointer for the supported pinned types.
              if (objRef->GetMethodTable() == g_pStringClass)
                  p = ((*(StringObject **)&objRef))->GetBuffer();
              else if (objRef->GetMethodTable()->IsArray())
                  p = (*((ArrayBase**)&objRef))->GetDataPtr();
              else
                  p = objRef->GetData();
          }
      
          return p;
      }
      FCIMPLEND
      

    In summary, the runtime lets you play with its data and doesn't complain. You're using unsafe code after all. I've seen worse runtime messing than that, including creating reference types on the stack ;-)

    Just remember to add one additional \0 after all the characters (at offset Length) if your final string is shorter than what's allocated. This won't overflow, each string has an implicit null character at the end to ease interop scenarios.


    Now take a look at how StringBuilder creates a string, here's StringBuilder.ToString:

    [System.Security.SecuritySafeCritical]  // auto-generated
    public override String ToString() {
        Contract.Ensures(Contract.Result<String>() != null);
    
        VerifyClassInvariant();
    
        if (Length == 0)
            return String.Empty;
    
        string ret = string.FastAllocateString(Length);
        StringBuilder chunk = this;
        unsafe {
            fixed (char* destinationPtr = ret)
            {
                do
                {
                    if (chunk.m_ChunkLength > 0)
                    {
                        // Copy these into local variables so that they are stable even in the presence of race conditions
                        char[] sourceArray = chunk.m_ChunkChars;
                        int chunkOffset = chunk.m_ChunkOffset;
                        int chunkLength = chunk.m_ChunkLength;
    
                        // Check that we will not overrun our boundaries. 
                        if ((uint)(chunkLength + chunkOffset) <= ret.Length && (uint)chunkLength <= (uint)sourceArray.Length)
                        {
                            fixed (char* sourcePtr = sourceArray)
                                string.wstrcpy(destinationPtr + chunkOffset, sourcePtr, chunkLength);
                        }
                        else
                        {
                            throw new ArgumentOutOfRangeException("chunkLength", Environment.GetResourceString("ArgumentOutOfRange_Index"));
                        }
                    }
                    chunk = chunk.m_ChunkPrevious;
                } while (chunk != null);
            }
        }
        return ret;
    }
    

    Yes, it uses unsafe code, and yes, you can optimize yours by using fixed, as this type of pinning is much more lightweight than allocating a GC handle:

    const string baseString = "The quick brown fox jumps over the lazy dog!";
    
    //initialize a new string
    string candidateString = new string('\0', baseString.Length);
    
    //Copy the contents of the base string to the candidate string
    unsafe
    {
        fixed (char* cCandidateString = candidateString)
        {
            for (int i = 0; i < baseString.Length; i++)
                cCandidateString[i] = baseString[i];
        }
    }
    

    When you use fixed, the GC only discovers an object needs to be pinned when it stumbles upon it during a collection. If there's no collection going on, the GC isn't even involved. When you use GCHandle, a handle is registered in the GC each time.

    0 讨论(0)
提交回复
热议问题