So we've got our HTML escape functions that really work in a C++ manner, how to do unescape?

问题

Here I've found a grate way to HTML encode/escape special chars. Now I wonder how to unescape HTML encoded text in C++?

So codebase is:

#include <algorithm>

namespace xml {

    // Helper for null-terminated ASCII strings (no end of string iterator).
    template<typename InIter, typename OutIter>
    OutIter copy_asciiz ( InIter begin, OutIter out )
    {
        while ( *begin != '\0' ) {
            *out++ = *begin++;
        }
        return (out);
    }

    // XML escaping in it's general form.  Note that 'out' is expected
    // to an "infinite" sequence.
    template<typename InIter, typename OutIter>
    OutIter escape ( InIter begin, InIter end, OutIter out )
    {
        static const char bad[] = "&<>";
        static const char* rep[] = {"&amp;", "&lt;", "&gt;"};
        static const std::size_t n = sizeof(bad)/sizeof(bad[0]);

        for ( ; (begin != end); ++begin )
        {
            // Find which replacement to use.
            const std::size_t i =
                std::distance(bad, std::find(bad, bad+n, *begin));

            // No need for escaping.
            if ( i == n ) {
                *out++ = *begin;
            }
            // Escape the character.
            else {
                out = copy_asciiz(rep[i], out);
            }
        }
        return (out);
    }

}

and

#include <iterator>
#include <string>

namespace xml {

    // Get escaped version of "content".
    std::string escape ( const std::string& content )
    {
        std::string result;
        result.reserve(content.size());
        escape(content.begin(), content.end(), std::back_inserter(result));
        return (result);
    }

    // Escape data on the fly, using "constant" memory.
    void escape ( std::istream& in, std::ostream& out )
    {
        escape(std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator<char>(out));
    }

}

Its - grate peace of code - it works for:

#include <iostream>

int main ( int, char ** )
{
    std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}

So I wonder - how to make HTML unescape in such manner?

回答1:

Take a look at how I've solved a similar problem for '&#(\d+);' strings i.e., numeric character references (NCRs) using boost::spirit, boost::regex_token_iterator, Flex, Perl.

In your case the regex is &(amp|lt|gt); if you don't need to convert all html entities.

来源：https://stackoverflow.com/questions/7976445/so-weve-got-our-html-escape-functions-that-really-work-in-a-c-manner-how-to

标签

c++

stl

escaping

html-encode