Following is the code snippet, the regex_replace dosn\'t work properly for character \"+\", I should not use special handling for the characters, but it should work properly.
You're interpreting the original string as a regex. +
is special in regex¹.
You should simply use std::string::replace
because you don't need regex replace functionality:
boost::smatch what;
if (regex_search(url.cbegin(), url.cend(), what, expression)) {
boost::ssub_match query = what[6];
url.replace(query.first, query.second, urlEncode(query.str(), false));
}
Complicated, scattered code like this:
could simply be:
std::string bktObjKey = what[6].str();
Complicated loop
for (std::string::size_type i = 0; i < toEncode.length(); ++i) {
char ch = toEncode.at(i);
Could just be
for (char ch : toEncode) {
charToHex
creates a new 2-char string everytime, using another stringstream everytime, copying the result out of the stringstream etc. Instead, just write to the stringstream you have and avoid all the inefficiency:
void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
os << std::setfill('0') << std::hex;
if (uppercase)
os << std::uppercase;
os << std::setw(2) << static_cast<int>(c);
}
Note this also fixes the fact that you forgot to use
bUppercase
Look at <cctype>
for help classifying characters.
Use raw literals to write
boost::regex expression("^(([^:/?#]+):)?(//([^/?#:]*)(:\\d+)?)?([^?#]*)((\\?[^#]*))?(#(.*))?");
instead as:
boost::regex expression(R"(^(([^:/?#]+):)?(//([^/?#:]*)(:\d+)?)?([^?#]*)((\?[^#]*))?(#(.*))?)");
(no need to doubly escape \d
and \?
)
Either drop all the redundant sub-groups
boost::regex expression(R"(^([^:/?#]+:)?(//[^/?#:]*(:\d+)?)?[^?#]*(\?[^#]*)?(#.*)?)");
OR make them maintainable and useful²:
boost::regex uri_regex(
R"(^((?<scheme>[^:/?#]+):)?)"
R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
R"((?<path>[^?#]*))"
R"((\?(?<query>([^#]*)))?)"
R"((#(?<fragment>.*))?)");
Now that you have access to logical components of the URI, apply it to know better when and where to encode:
std::string escaped =
what["scheme"].str() +
what["authority"].str() +
urlEncode(what["path"].str(), false);
if (query.matched) {
escaped += '?';
escaped.append(urlEncode(query, true));
}
if (fragment.matched) {
escaped += '#';
escaped.append(urlEncode(fragment, true));
}
Make an overload of urlEncode
that takes an existing ostream reference instead of always creating your own:
std::ostringstream out;
out << what["scheme"] << what["authority"];
urlEncode(out, what["path"], false);
if (query.matched)
urlEncode(out << '?', query, true);
if (fragment.matched)
urlEncode(out << '#', fragment, true);
Live On Coliru
#include <boost/regex.hpp>
#include <iostream>
#include <iomanip>
void writeHex(std::ostream& os, unsigned char c, bool uppercase) {
os << std::setfill('0') << std::hex;
if (uppercase)
os << std::uppercase;
os << '%' << std::setw(2) << static_cast<int>(c);
}
void urlEncode(std::ostream& os, const std::string &toEncode, bool bEncodeForwardSlash) {
auto is_safe = [=](uint8_t ch) {
return std::isalnum(ch) ||
(ch == '/' && !bEncodeForwardSlash) ||
std::strchr("_-~.", ch);
};
for (char ch : toEncode) {
if (is_safe(ch))
os << ch;
else
writeHex(os, ch, true);
}
}
std::string urlEncode(const std::string &toEncode, bool bEncodeForwardSlash) {
std::ostringstream out;
urlEncode(out, toEncode, bEncodeForwardSlash);
return out.str();
}
std::string getEncodedUrl(std::string url) {
boost::regex uri_regex(
R"(^((?<scheme>[^:/?#]+):)?)"
R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
R"((?<path>[^?#]*))"
R"((\?(?<query>([^#]*)))?)"
R"((#(?<fragment>.*))?)");
boost::match_results<std::string::iterator> what;
//boost::smatch what;
if (regex_search(url.begin(), url.end(), what, uri_regex)) {
auto& full = what[0];
auto& query = what["query"];
auto& fragment = what["fragment"];
std::ostringstream out;
out << what["scheme"] << what["authority"];
urlEncode(out, what["path"], false);
if (query.matched)
urlEncode(out << '?', query, true);
if (fragment.matched)
urlEncode(out << '#', fragment, true);
url.replace(full.begin(), full.end(), out.str());
}
return url;
}
int main() {
for (std::string url : {
"http://10.130.0.36/rbkt10/+",
"//10.130.0.36/rbkt10/+",
"//localhost:443/rbkt10/+",
"https:/rbkt10/+",
"https:/rbkt10/+?in_params='please do escape / (forward slash)'&more#also=in/fragment",
"match inside text http://10.130.0.36/rbkt10/+ is a bit fuzzy",
}) {
std::cout << "Encoded URL: " << getEncodedUrl(url) << std::endl;
}
}
Prints
Encoded URL: http//10.130.0.36/rbkt10/%2B
Encoded URL: //10.130.0.36/rbkt10/%2B
Encoded URL: //localhost%3A443/rbkt10/%2B
Encoded URL: https/rbkt10/%2B
Encoded URL: https/rbkt10/%2B?in_params%3D%27please%20do%20escape%20%2F%20%28forward%20slash%29%27%26more#also%3Din%2Ffragment
Encoded URL: match inside text http//10.130.0.36/rbkt10/%2B%20is%20a%20bit%20fuzzy
Notice that the code STILL doesn't adhere to the specs:
This is why you use a library instead.
¹ (This causes + to be left from the input. It's not "repeated", it's just not replaced because /+
means 1 or more /
).
² See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax