Regular Expression For Parsing Data

前端 未结 1 1449
一向
一向 2021-02-01 10:48

I am writing an application that reads some data from a simple text file. The data files, that I am interested in, have lines in the following form:

Mem(100) = 1         


        
相关标签:
1条回答
  • 2021-02-01 11:04

    First of all, remember to #include <regex>.

    C++ std::regex_match works like the regular expressions in other languages.

    Let's start with a simple example:

    std::string str = "Mem(100)=120";
    std::regex regex("^Mem\\([0-9]+\\)=[0-9]+$");
    std::cout << std::regex_match(str, regex) << std::endl;
    

    In this case, our regex is ^Mem\([0-9]+\)=[0-9]+$. Let's take a look at what it does:

    • The ^ at the beginning tells C++ this is where the line starts, so AMem(1)=2 should not match.
    • The $ at the end tells C++ this is where the line ends, so Mem(1)=2x should not match.
    • \\( is a literal ( character. ( has a very special meaning in regular expressions, so we escape it \(. However, the \ character has a special meaning in C++ strings, so we use \\( to tell C++ to pass the \( to the regular expression engine.
    • [0-9] matches a digit. \\dshould also work, but then again maybe not.
    • [0-9]+ means at least one digit. If Mem() is acceptable, then use [0-9]* instead.

    As you can see, this is just like the regular expressions you'd find in other languages (such as Java or C# ).

    Now, to consider whitespace, use std::regex regex("^\\s*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+\\s*$");

    Note that \s includes \t, so no need to specify both. If it didn't, you'd use (\s|\t) or [\s\t], not (\s,\t).

    Finally, to include float numbers, we first need to think if Mem(1) = 1. (that is, a dot without a number after it) is acceptable.

    If it isn't, then the .23 in 1.23 is optional. In regexes, we use ? to indicate that.

    std::regex regex("^[\\s]*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+(\\.[0-9]+)?\\s*$");
    

    Note that we use \. instead of just .. . has a special meaning in regular expressions - it matches any character - so we need to escape it.

    If you have a compiler that supports raw strings (e.g. Visual Studio 2013, GCC 4.5, Clang 3.0), you can simplify the regex string:

    std::regex regex(R"(^[\s]*Mem\([0-9]+\)\s*=\s*[0-9]+(\.[0-9]+)?\s*$)")
    

    To extract information about the matched string, you can use std::smatch and groups.

    Let's start with a small change:

    std::string str = " Mem(100)=120";
    std::regex regex("^[\\s]*Mem\\(([0-9]+)\\)\\s*=\\s*([0-9]+(\\.[0-9]+)?)\\s*$");
    std::smatch m;
    
    std::cout << std::regex_match(str, m, regex) << std::endl;
    

    Note three things:

    1. We added smatch. This class stores extra result info about the match.
    2. We added additional parenthesis around [0-9]*. This defines a group. Groups tell the regex engine to keep track of whatever is within them.
    3. Yet more parenthesis around the floating point number. This defines a second group.

    Very importantly the parenthesis that define groups are NOT escaped since we don't want them to match actual parenthesis characters. We actually want the special regex meaning.

    Now that we have the groups, we can use them:

    for (auto result : m) {
        std::cout << result << std::endl;
    }
    

    This will first print the whole string, then the number in Mem(), then the final number.

    In other words, m[0] gives us the whole match, m[1] gives us the first group, m[2] gives us the second group and m[3] would give us the third group if we had one.

    0 讨论(0)
提交回复
热议问题