While doing url encoding, the std::regex_replace doesn't work properly for character “+”

前端 未结 1 1349
庸人自扰
庸人自扰 2021-01-21 10:20

Following is the code snippet, the regex_replace dosn\'t work properly for character \"+\", I should not use special handling for the characters, but it should work properly.

相关标签:
1条回答
  • 2021-01-21 10:47
    1. You're interpreting the original string as a regex. + is special in regex¹.

      You should simply use std::string::replace because you don't need regex replace functionality:

      boost::smatch what;
      if (regex_search(url.cbegin(), url.cend(), what, expression)) {
          boost::ssub_match query = what[6];
          url.replace(query.first, query.second, urlEncode(query.str(), false));
      }
      
    2. Complicated, scattered code like this:
      could simply be:

      std::string bktObjKey = what[6].str();
      
    3. Complicated loop

      for (std::string::size_type i = 0; i < toEncode.length(); ++i) {
           char ch = toEncode.at(i);
      

      Could just be

      for (char ch : toEncode) {
      
    4. charToHex creates a new 2-char string everytime, using another stringstream everytime, copying the result out of the stringstream etc. Instead, just write to the stringstream you have and avoid all the inefficiency:

      void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
          os << std::setfill('0') << std::hex;
          if (uppercase) 
              os << std::uppercase;
          os << std::setw(2) << static_cast<int>(c);
      }
      

      Note this also fixes the fact that you forgot to use bUppercase

    5. Look at <cctype> for help classifying characters.

    6. Use raw literals to write

      boost::regex expression("^(([^:/?#]+):)?(//([^/?#:]*)(:\\d+)?)?([^?#]*)((\\?[^#]*))?(#(.*))?");
      

      instead as:

      boost::regex expression(R"(^(([^:/?#]+):)?(//([^/?#:]*)(:\d+)?)?([^?#]*)((\?[^#]*))?(#(.*))?)");
      

      (no need to doubly escape \d and \?)

    7. Either drop all the redundant sub-groups

      boost::regex expression(R"(^([^:/?#]+:)?(//[^/?#:]*(:\d+)?)?[^?#]*(\?[^#]*)?(#.*)?)");
      

      OR make them maintainable and useful²:

      boost::regex uri_regex(
          R"(^((?<scheme>[^:/?#]+):)?)"
          R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
          R"((?<path>[^?#]*))"
          R"((\?(?<query>([^#]*)))?)"
          R"((#(?<fragment>.*))?)");
      
    8. Now that you have access to logical components of the URI, apply it to know better when and where to encode:

          std::string escaped = 
             what["scheme"].str() + 
             what["authority"].str() +
             urlEncode(what["path"].str(), false);
      
          if (query.matched) {
              escaped += '?';
              escaped.append(urlEncode(query, true));
          }
      
          if (fragment.matched) {
              escaped += '#';
              escaped.append(urlEncode(fragment, true));
          }
      
    9. Make an overload of urlEncode that takes an existing ostream reference instead of always creating your own:

      std::ostringstream out;
      out << what["scheme"] << what["authority"];
      urlEncode(out, what["path"], false);
      
      if (query.matched)
          urlEncode(out << '?', query, true);
      
      if (fragment.matched)
          urlEncode(out << '#', fragment, true);
      

    Code After Review

    Live On Coliru

    #include <boost/regex.hpp>
    #include <iostream>
    #include <iomanip>
    
    void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
        os << std::setfill('0') << std::hex;
        if (uppercase) 
            os << std::uppercase;
        os << '%' << std::setw(2) << static_cast<int>(c);
    }
    
    void urlEncode(std::ostream& os, const std::string &toEncode, bool bEncodeForwardSlash) {
        auto is_safe = [=](uint8_t ch) {
            return std::isalnum(ch) ||
                (ch == '/' && !bEncodeForwardSlash) ||
                std::strchr("_-~.", ch);
        };
    
        for (char ch : toEncode) {
            if (is_safe(ch))
                os << ch;
            else
                writeHex(os, ch, true);
        }
    }
    
    std::string urlEncode(const std::string &toEncode, bool bEncodeForwardSlash) {
        std::ostringstream out;
        urlEncode(out, toEncode, bEncodeForwardSlash);
        return out.str();
    }
    
    std::string getEncodedUrl(std::string url) {
    
        boost::regex uri_regex(
            R"(^((?<scheme>[^:/?#]+):)?)"
            R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
            R"((?<path>[^?#]*))"
            R"((\?(?<query>([^#]*)))?)"
            R"((#(?<fragment>.*))?)");
    
        boost::match_results<std::string::iterator> what;
        //boost::smatch what;
        if (regex_search(url.begin(), url.end(), what, uri_regex)) {
            auto& full     = what[0];
            auto& query    = what["query"];
            auto& fragment = what["fragment"];
    
            std::ostringstream out;
            out << what["scheme"] << what["authority"];
            urlEncode(out, what["path"], false);
    
            if (query.matched)
                urlEncode(out << '?', query, true);
    
            if (fragment.matched)
                urlEncode(out << '#', fragment, true);
    
            url.replace(full.begin(), full.end(), out.str());
        }
        return url;
    }
    
    int main() {
        for (std::string url : { 
                "http://10.130.0.36/rbkt10/+",
                "//10.130.0.36/rbkt10/+",
                "//localhost:443/rbkt10/+",
                "https:/rbkt10/+",
                "https:/rbkt10/+?in_params='please do escape / (forward slash)'&more#also=in/fragment",
                "match inside text http://10.130.0.36/rbkt10/+ is a bit fuzzy",
              }) {
            std::cout << "Encoded URL: " << getEncodedUrl(url) << std::endl;
        }
    }
    

    Prints

    Encoded URL: http//10.130.0.36/rbkt10/%2B
    Encoded URL: //10.130.0.36/rbkt10/%2B
    Encoded URL: //localhost%3A443/rbkt10/%2B
    Encoded URL: https/rbkt10/%2B
    Encoded URL: https/rbkt10/%2B?in_params%3D%27please%20do%20escape%20%2F%20%28forward%20slash%29%27%26more#also%3Din%2Ffragment
    Encoded URL: match inside text http//10.130.0.36/rbkt10/%2B%20is%20a%20bit%20fuzzy
    

    CAUTION

    Notice that the code STILL doesn't adhere to the specs:

    This is why you use a library instead.


    ¹ (This causes + to be left from the input. It's not "repeated", it's just not replaced because /+ means 1 or more /).

    ² See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

    0 讨论(0)
提交回复
热议问题