TryParse without Actual Parsing or any other Alternative for Checking Text Format with Performance Benefit

前端 未结 1 1344
感情败类
感情败类 2021-01-14 01:37

I currently am making my own library, called TextCheckerExtension which basically tries to check Text Format before further processing (short code snippet shown

相关标签:
1条回答
  • 2021-01-14 01:57

    It does make a difference.

    To my surprise: as I continue this project out of curiosity, I found out that doing the actual parsing and simply checking if a string is of certain format does make a significant difference in time performance.

    In my experiment below, by creating checker without parser, we could gain 33.77% to 58.26% time gain as compared to using built-in TryParse. In addition, I also compare my extension with VB.Net IsNumeric in Microsoft.VisualBasic.Information dll.

    Here are the (1) tested code, (2) testing scenario, (3) testing code, and (4) testing result (notes are added in each part whenever necessary):


    Tested Code:

    Here is the tested code, my extension code named Extension.Checker.Text. I only tested scenarios for generic integer and float/double (with/without dot - perhaps better termed fraction-ed number) so far. By generic integer I mean that the maximum and minimum value range (such as -128 to 127 for 8-bit signed integer) is unchecked. This code is just to determine if a text is integer as human understands it without looking at its range. That goes the same for float/double.

    Compare with this post which has 400+ upvotes on its answer by the time this answer is posted, I believe it is safe to assume that generally we will use int.TryParse to test if a text is an integer or not as a first try (albeit its range is limited to -2e9 to 2e9) for generic integer text. Some other posts also show the same trend alike. Another way which we could see from those posts are to check by Visual Basic IsNumeric. Thus, I included that method for the benchmarking too.

    public static bool IsFloatOrDoubleByDot(string str) { //another criterion for float, giving "f" in the last part?
            if (string.IsNullOrWhiteSpace(str))
                return false;
            int dotCounter = 0;
            for (int i = str[0] == '-' ? 1 : 0; i < str.Length; i++) { //Check if it is float
        if (!(char.IsDigit(str, i)) && (str[i] != '.'))
          return false;
        else if (str[i] == '.')
          ++dotCounter; //Increase the dotCounter whenever dot is found
        if (dotCounter > 1) //If there is more than one dot for whatever reason, return error
          return false;
      }
      return dotCounter == 0 || dotCounter == 1 && str.Length > 1;
    }
    
    public static bool IsDigitsOnly(string str) {
      foreach (char c in str)
        if (c < '0' || c > '9')
          return false;      
      return str.Length >= 1; //there must be at least one character here to continue
    }
    
    public static bool IsInt(string str) { //is not designed to handle null input or empty string
            if (string.IsNullOrWhiteSpace(str))
                return false;           
      return str[0] == '-' && str.Length > 1 ? IsDigitsOnly(str.Substring(1)) : IsDigitsOnly(str);
    }
    




    Testing Scenario:

    So far, I have tested four different scenarios:

    • integer (in the parse-able range by int.TryParse)
    • float text containing dot (max of 7-digit precision, in the accurate parse-able range by float.TryParse)
    • double text containing dot (max of 11-digit precision, in the accurate parse-able range by double.TryParse)
    • integer text read as float/double text (in the parse-able range by double.TryParse)

    And for each scenario, I have four cases to test:

    • Valid positive-valued text
    • Valid negative-valued text
    • Invalid positive-valued text
    • Invalid negative-value text

    And for each case I tested the time needed to do the checking by:

    • Suitable TryParse
    • Suitable Extension.Checker.Text
    • Visual Basic IsNumeric
    • Other type-specific tricks like string.All(char.IsDigit) for integer




    Testing Code:

    To test the above scenarios, I use the following data:

    string intpos = "1342517340";
    string intneg = "-1342517340";
    string intfalsepos = "134251734u";
    string intfalseneg = "-134251734u";
    string floatpos = "56.34251";
    string floatneg = "-56.34251";
    string floatfalsepos = "56.3425h";
    string floatfalseneg = "-56.3425h";
    string doublepos = "56.342515312";
    string doubleneg = "-56.342515312";
    string doublefalsepos = "56.34251531y";
    string doublefalseneg = "-56.34251531y";
    List<string> liststr = new List<string>() {
        intpos, intneg, intfalsepos, intfalseneg,
        floatpos, floatneg, floatfalsepos, floatfalseneg,
        doublepos, doubleneg, doublefalsepos, doublefalseneg
    };
    List<string> liststrcode = new List<string>() {
        "i+", "i-", "if+", "if-",
        "f+", "f-", "ff+", "ff-",
        "d+", "d-", "df+", "df-"
    };
    bool parsed = false; //to store checking result
    int intval; //for int.TryParse result
    float fval; //for float.TryParse result
    double dval; //for double.TryParse result
    

    text code is in the format of . Examples:

    • if+ = integer false positive
    • f- = float negative

    And I use the following testing loop to get the time performance of each method per case:

    //time snap
    for (int i = 0; i < 10000000; ++i) //for integer case
        parsed = int.TryParse(str, out intval); //built-in TryParse
    //time snap
    //Print the result
    //time snap
    for (int i = 0; i < 10000000; ++i)
        parsed = Extension.Checker.Text.IsInt(str); //extension Text checker
    //time snap
    //Print the result
    //time snap
    for (int i = 0; i < 10000000; ++i)
        parsed = Information.IsNumeric(str); //Microsoft.VisualBasic
    //time snap
    //Print the result
    //time snap
    for (int i = 0; i < 10000000; ++i)
        parsed = str[0] == '-' ? str.Substring(1).All(char.IsDigit) : str.All(char.IsDigit); //misc methods
    //time snap
    //Print the result
    //Print the result difference
    

    I tested as many as 10 million iterations per testing case per method using my laptop.

    Note: it is noted that the behavior of my Extension.Checker.Text is not completely equivalent with built-in TryParse such as checking the range of the numerical value of the string or string with other formats which might be acceptable for TryParse case but not in my case. This is because the main purpose of my Extension.Checker.Text is not to necessarily convert the given text into certain data type in C# as built-in TryParse. And that is the very point of my Extension.Checker.Text. The comparisons made here is merely done to compare - in terms of time performance benefits - (1) the popular way of checking certain text format with (2) the extension method we could possibly made given that we do not need the result of the TryParse, but only if a text is of certain format or not. That goes the same for comparison with VB IsNumeric




    Testing Result:

    I printed out the parse/check result to ensure that my extension has the same result as the built-in TryParse, VB.Net IsNumeric, and other alternative tricks for the given cases. I also print the original text for easy reading/checking. Then, by the time snap in between the testing, I could get the time performance as well as time difference for each testing case, which I also printed out. The time gain comparison however, is only done with the TryParse. Here is the complete result.

    [2016-01-05 06:04:25.466 UTC] Integer:
    [2016-01-05 06:04:26.999 UTC] TryParse i+:  1531 ms Result: True    Text: 1342517340
    [2016-01-05 06:04:27.639 UTC] Extension i+:     639 ms  Result: True    Text: 1342517340
    [2016-01-05 06:04:30.345 UTC] VB.IsNumeric i+:  2705 ms Result: True    Text: 1342517340
    [2016-01-05 06:04:31.468 UTC] All is digit i+:  1124 ms Result: True    Text: 1342517340
    [2016-01-05 06:04:31.469 UTC] Gain on TryParse i+:  892 ms  Percent: -58.26%
    [2016-01-05 06:04:31.469 UTC] 
    [2016-01-05 06:04:32.996 UTC] TryParse i-:  1527 ms Result: True    Text: -1342517340
    [2016-01-05 06:04:33.846 UTC] Extension i-:     849 ms  Result: True    Text: -1342517340
    [2016-01-05 06:04:36.413 UTC] VB.IsNumeric i-:  2566 ms Result: True    Text: -1342517340
    [2016-01-05 06:04:37.693 UTC] All is digit i-:  1280 ms Result: True    Text: -1342517340
    [2016-01-05 06:04:37.694 UTC] Gain on TryParse i-:  678 ms  Percent: -44.40%
    [2016-01-05 06:04:37.694 UTC] 
    [2016-01-05 06:04:39.058 UTC] TryParse if+:     1364 ms Result: False   Text: 134251734u
    [2016-01-05 06:04:39.845 UTC] Extension if+:    786 ms  Result: False   Text: 134251734u
    [2016-01-05 06:04:42.436 UTC] VB.IsNumeric if+:     2590 ms Result: False   Text: 134251734u
    [2016-01-05 06:04:43.540 UTC] All is digit if+:     1103 ms Result: False   Text: 134251734u
    [2016-01-05 06:04:43.540 UTC] Gain on TryParse if+:     578 ms  Percent: -42.38%
    [2016-01-05 06:04:43.540 UTC] 
    [2016-01-05 06:04:44.937 UTC] TryParse if-:     1397 ms Result: False   Text: -134251734u
    [2016-01-05 06:04:45.745 UTC] Extension if-:    807 ms  Result: False   Text: -134251734u
    [2016-01-05 06:04:48.275 UTC] VB.IsNumeric if-:     2530 ms Result: False   Text: -134251734u
    [2016-01-05 06:04:49.541 UTC] All is digit if-:     1267 ms Result: False   Text: -134251734u
    [2016-01-05 06:04:49.542 UTC] Gain on TryParse if-:     590 ms  Percent: -42.23%
    [2016-01-05 06:04:49.542 UTC] 
    [2016-01-05 06:04:49.542 UTC] Float by Dot:
    [2016-01-05 06:04:51.136 UTC] TryParse f+:  1594 ms Result: True    Text: 56.34251
    [2016-01-05 06:04:51.967 UTC] Extension f+:     830 ms  Result: True    Text: 56.34251
    [2016-01-05 06:04:54.328 UTC] VB.IsNumeric f+:  2360 ms Result: True    Text: 56.34251
    [2016-01-05 06:04:54.329 UTC] Time Gain f+:     764 ms  Percent: -47.93%
    [2016-01-05 06:04:54.329 UTC] 
    [2016-01-05 06:04:55.962 UTC] TryParse f-:  1634 ms Result: True    Text: -56.34251
    [2016-01-05 06:04:56.790 UTC] Extension f-:     827 ms  Result: True    Text: -56.34251
    [2016-01-05 06:04:59.102 UTC] VB.IsNumeric f-:  2313 ms Result: True    Text: -56.34251
    [2016-01-05 06:04:59.103 UTC] Time Gain f-:     807 ms  Percent: -49.39%
    [2016-01-05 06:04:59.103 UTC] 
    [2016-01-05 06:05:00.623 UTC] TryParse ff+:     1519 ms Result: False   Text: 56.3425h
    [2016-01-05 06:05:01.429 UTC] Extension ff+:    802 ms  Result: False   Text: 56.3425h
    [2016-01-05 06:05:03.730 UTC] VB.IsNumeric ff+:     2301 ms Result: False   Text: 56.3425h
    [2016-01-05 06:05:03.730 UTC] Time Gain ff+:    717 ms  Percent: -47.20%
    [2016-01-05 06:05:03.731 UTC] 
    [2016-01-05 06:05:05.312 UTC] TryParse ff-:     1581 ms Result: False   Text: -56.3425h
    [2016-01-05 06:05:06.147 UTC] Extension ff-:    835 ms  Result: False   Text: -56.3425h
    [2016-01-05 06:05:08.485 UTC] VB.IsNumeric ff-:     2337 ms Result: False   Text: -56.3425h
    [2016-01-05 06:05:08.486 UTC] Time Gain ff-:    746 ms  Percent: -47.19%
    [2016-01-05 06:05:08.486 UTC] 
    [2016-01-05 06:05:08.487 UTC] Double by Dot:
    [2016-01-05 06:05:10.341 UTC] TryParse d+:  1854 ms Result: True    Text: 56.342515312
    [2016-01-05 06:05:11.492 UTC] Extension d+:     1151 ms Result: True    Text: 56.342515312
    [2016-01-05 06:05:14.035 UTC] VB.IsNumeric d+:  2541 ms Result: True    Text: 56.342515312
    [2016-01-05 06:05:14.035 UTC] Time Gain d+:     703 ms  Percent: -37.92%
    [2016-01-05 06:05:14.036 UTC] 
    [2016-01-05 06:05:15.916 UTC] TryParse d-:  1879 ms Result: True    Text: -56.342515312
    [2016-01-05 06:05:17.051 UTC] Extension d-:     1133 ms Result: True    Text: -56.342515312
    [2016-01-05 06:05:19.542 UTC] VB.IsNumeric d-:  2492 ms Result: True    Text: -56.342515312
    [2016-01-05 06:05:19.543 UTC] Time Gain d-:     746 ms  Percent: -39.70%
    [2016-01-05 06:05:19.543 UTC] 
    [2016-01-05 06:05:21.210 UTC] TryParse df+:     1667 ms Result: False   Text: 56.34251531y
    [2016-01-05 06:05:22.315 UTC] Extension df+:    1104 ms Result: False   Text: 56.34251531y
    [2016-01-05 06:05:24.797 UTC] VB.IsNumeric df+:     2481 ms Result: False   Text: 56.34251531y
    [2016-01-05 06:05:24.798 UTC] Time Gain df+:    563 ms  Percent: -33.77%
    [2016-01-05 06:05:24.798 UTC] 
    [2016-01-05 06:05:26.509 UTC] TryParse df-:     1711 ms Result: False   Text: -56.34251531y
    [2016-01-05 06:05:27.596 UTC] Extension df-:    1086 ms Result: False   Text: -56.34251531y
    [2016-01-05 06:05:30.039 UTC] VB.IsNumeric df-:     2442 ms Result: False   Text: -56.34251531y
    [2016-01-05 06:05:30.040 UTC] Time Gain df-:    625 ms  Percent: -36.53%
    [2016-01-05 06:05:30.041 UTC] 
    [2016-01-05 06:05:30.041 UTC] Integer as Double by Dot:
    [2016-01-05 06:05:31.794 UTC] TryParse (doubled) i+:    1752 ms Result: True    Text: 1342517340
    [2016-01-05 06:05:32.904 UTC] Extension (doubled) i+:   1109 ms Result: True    Text: 1342517340
    [2016-01-05 06:05:35.590 UTC] VB.IsNumeric (doubled) d+:    2684 ms Result: True    Text: 1342517340
    [2016-01-05 06:05:35.590 UTC] Time Gain d+:     643 ms  Percent: -36.70%
    [2016-01-05 06:05:35.591 UTC] 
    [2016-01-05 06:05:37.390 UTC] TryParse (doubled) i-:    1799 ms Result: True    Text: -1342517340
    [2016-01-05 06:05:38.515 UTC] Extension (doubled) i-:   1125 ms Result: True    Text: -1342517340
    [2016-01-05 06:05:41.139 UTC] VB.IsNumeric (doubled) d-:    2623 ms Result: True    Text: -1342517340
    [2016-01-05 06:05:41.139 UTC] Time Gain d-:     674 ms  Percent: -37.47%
    [2016-01-05 06:05:41.140 UTC] 
    [2016-01-05 06:05:42.840 UTC] TryParse (doubled) if+:   1700 ms Result: False   Text: 134251734u
    [2016-01-05 06:05:43.933 UTC] Extension (doubled) if+:  1092 ms Result: False   Text: 134251734u
    [2016-01-05 06:05:46.575 UTC] VB.IsNumeric (doubled) df+:   2642 ms Result: False   Text: 134251734u
    [2016-01-05 06:05:46.576 UTC] Time Gain df+:    608 ms  Percent: -35.76%
    [2016-01-05 06:05:46.577 UTC] 
    [2016-01-05 06:05:48.328 UTC] TryParse (doubled) if-:   1750 ms Result: False   Text: -134251734u
    [2016-01-05 06:05:49.434 UTC] Extension (doubled) if-:  1106 ms Result: False   Text: -134251734u
    [2016-01-05 06:05:52.042 UTC] VB.IsNumeric (doubled) df-:   2607 ms Result: False   Text: -134251734u
    [2016-01-05 06:05:52.042 UTC] Time Gain df-:    644 ms  Percent: -36.80%
    [2016-01-05 06:05:52.043 UTC] 
    

    The conclusions I got from the results so far:

    • Best performance gain we can obtain using an extension method such as above is when the text type is valid positive integer. The time performance gain we could get is as much as 58.26% for the given case. Perhaps this owes to the simplicity of the valid positive integer text.
    • Worst performance gain we can obtain using an extension method such as above is when the text type is invalid positive double. The time performance gain we could get is only as much as 33.77% for the given case.
    • For the integer and float/double (with/without dot) text format, to check if a text is of those formats without the need to actually parse it yet, it is possible to speed up the checking process by building our own text extension checker as compared to using built-in TryParse. VB IsNumeric is rather slower than the rests for all cases (this is also to my surprise, because according to the benchmarking in this post, VB seems to be pretty fast - though not the best).


    Possible uses:

    One possible use of this extension checking is in the case where you receive a certain string and you know that it can be of more than one format types (say, integer or double), but you want to check the actual text type first without an actual parsing at the time of checking. For such given case, an extension method may speed up the process.

    Another use is in the computational linguistic area, where often you want to know the type a text without actually parsing it to be used computationally.

    0 讨论(0)
提交回复
热议问题