While doing url encoding, the std::regex_replace doesn't work properly for character “+”

前端未结

关注

 1  1348

庸人自扰 2021-01-21 10:20

Following is the code snippet, the regex_replace dosn\'t work properly for character \"+\", I should not use special handling for the characters, but it should work properly.

1条回答

伪装坚强ぢ (楼主)

2021-01-21 10:47

You're interpreting the original string as a regex. + is special in regex¹.

You should simply use std::string::replace because you don't need regex replace functionality:

boost::smatch what;
if (regex_search(url.cbegin(), url.cend(), what, expression)) {
    boost::ssub_match query = what[6];
    url.replace(query.first, query.second, urlEncode(query.str(), false));
}

Complicated, scattered code like this:
could simply be:
```
std::string bktObjKey = what[6].str();
```

Complicated loop

for (std::string::size_type i = 0; i < toEncode.length(); ++i) {
     char ch = toEncode.at(i);

Could just be

for (char ch : toEncode) {

charToHex creates a new 2-char string everytime, using another stringstream everytime, copying the result out of the stringstream etc. Instead, just write to the stringstream you have and avoid all the inefficiency:
```
void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
    os << std::setfill('0') << std::hex;
    if (uppercase) 
        os << std::uppercase;
    os << std::setw(2) << static_cast(c);
}
```
Note this also fixes the fact that you forgot to use bUppercase
Look at for help classifying characters.

Use raw literals to write

boost::regex expression("^(([^:/?#]+):)?(//([^/?#:]*)(:\\d+)?)?([^?#]*)((\\?[^#]*))?(#(.*))?");

instead as:

boost::regex expression(R"(^(([^:/?#]+):)?(//([^/?#:]*)(:\d+)?)?([^?#]*)((\?[^#]*))?(#(.*))?)");

(no need to doubly escape \d and \?)

Either drop all the redundant sub-groups

boost::regex expression(R"(^([^:/?#]+:)?(//[^/?#:]*(:\d+)?)?[^?#]*(\?[^#]*)?(#.*)?)");

OR make them maintainable and useful²:

boost::regex uri_regex(
    R"(^((?[^:/?#]+):)?)"
    R"((?//(\?[^/?#:]*)(:(?\d+))?)?)"
    R"((?[^?#]*))"
    R"((\?(?([^#]*)))?)"
    R"((#(?.*))?)");

Now that you have access to logical components of the URI, apply it to know better when and where to encode:

    std::string escaped = 
       what["scheme"].str() + 
       what["authority"].str() +
       urlEncode(what["path"].str(), false);

    if (query.matched) {
        escaped += '?';
        escaped.append(urlEncode(query, true));
    }

    if (fragment.matched) {
        escaped += '#';
        escaped.append(urlEncode(fragment, true));
    }

Make an overload of urlEncode that takes an existing ostream reference instead of always creating your own:

std::ostringstream out;
out << what["scheme"] << what["authority"];
urlEncode(out, what["path"], false);

if (query.matched)
    urlEncode(out << '?', query, true);

if (fragment.matched)
    urlEncode(out << '#', fragment, true);

Code After Review

Live On Coliru

#include 
#include 
#include 

void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
    os << std::setfill('0') << std::hex;
    if (uppercase) 
        os << std::uppercase;
    os << '%' << std::setw(2) << static_cast(c);
}

void urlEncode(std::ostream& os, const std::string &toEncode, bool bEncodeForwardSlash) {
    auto is_safe = [=](uint8_t ch) {
        return std::isalnum(ch) ||
            (ch == '/' && !bEncodeForwardSlash) ||
            std::strchr("_-~.", ch);
    };

    for (char ch : toEncode) {
        if (is_safe(ch))
            os << ch;
        else
            writeHex(os, ch, true);
    }
}

std::string urlEncode(const std::string &toEncode, bool bEncodeForwardSlash) {
    std::ostringstream out;
    urlEncode(out, toEncode, bEncodeForwardSlash);
    return out.str();
}

std::string getEncodedUrl(std::string url) {

    boost::regex uri_regex(
        R"(^((?[^:/?#]+):)?)"
        R"((?//(\?[^/?#:]*)(:(?\d+))?)?)"
        R"((?[^?#]*))"
        R"((\?(?([^#]*)))?)"
        R"((#(?.*))?)");

    boost::match_results what;
    //boost::smatch what;
    if (regex_search(url.begin(), url.end(), what, uri_regex)) {
        auto& full     = what[0];
        auto& query    = what["query"];
        auto& fragment = what["fragment"];

        std::ostringstream out;
        out << what["scheme"] << what["authority"];
        urlEncode(out, what["path"], false);

        if (query.matched)
            urlEncode(out << '?', query, true);

        if (fragment.matched)
            urlEncode(out << '#', fragment, true);

        url.replace(full.begin(), full.end(), out.str());
    }
    return url;
}

int main() {
    for (std::string url : { 
            "http://10.130.0.36/rbkt10/+",
            "//10.130.0.36/rbkt10/+",
            "//localhost:443/rbkt10/+",
            "https:/rbkt10/+",
            "https:/rbkt10/+?in_params='please do escape / (forward slash)'&more#also=in/fragment",
            "match inside text http://10.130.0.36/rbkt10/+ is a bit fuzzy",
          }) {
        std::cout << "Encoded URL: " << getEncodedUrl(url) << std::endl;
    }
}

Prints

Encoded URL: http//10.130.0.36/rbkt10/%2B
Encoded URL: //10.130.0.36/rbkt10/%2B
Encoded URL: //localhost%3A443/rbkt10/%2B
Encoded URL: https/rbkt10/%2B
Encoded URL: https/rbkt10/%2B?in_params%3D%27please%20do%20escape%20%2F%20%28forward%20slash%29%27%26more#also%3Din%2Ffragment
Encoded URL: match inside text http//10.130.0.36/rbkt10/%2B%20is%20a%20bit%20fuzzy

CAUTION

Notice that the code STILL doesn't adhere to the specs:

This is why you use a library instead.

¹ (This causes + to be left from the input. It's not "repeated", it's just not replaced because /+ means 1 or more /).

² See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

0 讨论(0)