.NET Regular expressions on bytes instead of chars

前端 未结 4 1847
栀梦
栀梦 2020-12-29 07:46

I\'m trying to do some parsing that will be easier using regular expressions.

The input is an array (or enumeration) of bytes.

I don\'t want to convert the b

相关标签:
4条回答
  • 2020-12-29 07:59

    Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.

    The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.

    Update:

    Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:

    1. Using wrapper you just have to cast the pointer type (bytes <-> chars).
    2. Using .NET's Regex you have to convert each byte of the input.
    0 讨论(0)
  • 2020-12-29 08:01

    As an alternative to using unsafe, just consider writing a simple, recursive comparer like:

    static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
    {
           if (sequence[sequenceIndex] == data[dataIndex])
           {
               if (sequenceIndex == sequence.Length - 1)
                   return true;
               else if (dataIndex == data.Length - 1)
                   return false;
               else
                   return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
           }
           else
           {
               if (dataIndex < data.Length - 1)
                   return Evaluate(data, sequence, dataIndex+1, 0);
               else
                   return false;
           }
    }
    

    You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.

    0 讨论(0)
  • 2020-12-29 08:12

    There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.

    However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).

    An example:

    //BLING
    byte[] inputBuffer = { 66, 76, 73, 78, 71 };
    
    string stringBuffer = new string('\0', 1000);
    
    Regex regex = new Regex("ING", RegexOptions.Compiled);
    
    unsafe
    {
        fixed (char* charArray = stringBuffer)
        {
            byte* buffer = (byte*)(charArray);
    
            //Hard-coded example of string mutation, in practice you would
            //loop over your input buffers and regex\match so that the string
            //buffer is re-used.
    
            buffer[0] = inputBuffer[0];
            buffer[2] = inputBuffer[1];
            buffer[4] = inputBuffer[2];
            buffer[6] = inputBuffer[3];
            buffer[8] = inputBuffer[4];
    
            Console.WriteLine("Mutated string:'{0}'.",
                 stringBuffer.Substring(0, inputBuffer.Length));
    
            Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
    
            Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
        }
    }
    

    Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.

    Obviously this is unsafe code, but it is .Net.

    The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.

    Update

    Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.

    0 讨论(0)
  • 2020-12-29 08:16

    I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.

    bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
    {
        payload = new byte[0];
        functionResponse = UDScmd.Response.UNKNOWN;
        bool positiveReponse = false;
        var rxMsgBytes = rxMsg.GetBytes();
    
        //Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
        //If we could use some kind of HEX regex this would be a bit neater
        //Iterate until we get past any and all null padding
        int stateMachine = 0;
        for (int i = 0; i < rxMsgBytes.Length; i++)
        {
            switch (stateMachine)
            {
                case 0:
                    if (rxMsgBytes[i] == 0x07) stateMachine = 1;
                    break;
                case 1:
                    if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
                    else return false;
                case 2:
                    if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
                    {
                        //Positive response to the requested mode
                        positiveReponse = true;
                    }
                    else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
                    {
                        //This is an invalid response, give up now
                        return false;
                    }
                    stateMachine = 3;
                    break;
                case 3:
                    functionResponse = (UDScmd.Response)rxMsgBytes[i];
                    if (positiveReponse && rxMsgBytes[i] == txSubFunction)
                    {
                        //We have a positive response and a positive subfunction code (subfunction is reflected)
                        int payloadLength = rxMsgBytes.Length - i;
                        if(payloadLength > 0)
                        {
                            payload = new byte[payloadLength];
                            Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
                        }
                        return true;
                    } else
                    {
                        //We had a positive response but a negative subfunction error
                        //we return the function error code so it can be relayed
                        return false;
                    }
                default:
                    return false;
            }
        }
        return false;
    }
    
    0 讨论(0)
提交回复
热议问题