Bash script to convert from HTML entities to characters

前端 未结 10 599
悲哀的现实
悲哀的现实 2020-12-02 09:59

I\'m looking for a way to turn this:

hello < world

to this:

hello < world

I could use sed, but

相关标签:
10条回答
  • 2020-12-02 10:12

    Using xmlstarlet:

    echo 'hello &lt; world' | xmlstarlet unesc
    
    0 讨论(0)
  • 2020-12-02 10:13

    Try recode (archived page; GitHub mirror; Debian page):

    $ echo '&lt;' |recode html..ascii
    <
    

    Install on Linux and similar Unix-y systems:

    $ sudo apt-get install recode
    

    Install on Mac OS using:

    $ brew install recode
    
    0 讨论(0)
  • 2020-12-02 10:13

    A python 3.2+ version:

    cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
    
    0 讨论(0)
  • 2020-12-02 10:16

    I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.

    cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
    

    But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)

    I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --

    python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
    

    but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.

    Here is another easy pythonic way without buffering in memory: using awkg.

    $ echo 'hello &lt; &#x3a; &quot; world' | \
       awkg -b 'from html import unescape' 'print(unescape(R0))'
    hello < : " world
    

    awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:

    pip install awkg
    

    -b is awk's BEGIN{} block that runs once in the beginning.
    Here we just did from html import unescape.

    Each line record is in R0 variable, for which we did print(unescape(R0))

    Disclaimer:
    I am the maintainer of awkg

    0 讨论(0)
  • 2020-12-02 10:19

    I have created a sed script based on the list of entities so it must handle most of the entities.

    sed -f htmlentities.sed < file.html
    
    0 讨论(0)
  • 2020-12-02 10:20

    To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.

    But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):

    #!/bin/sh
    
    htmlEscDec2Hex() {
        file=$1
        [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
    
        printf -- \
            "$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
            $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
    
        [ x"$1" != x"$file" ] && rm -f -- "$file"
    }
    
    htmlHexUnescape() {
        printf -- "$(
            sed 's/\\/\\\\/g;s/%/%%/g
                ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
                ;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
                ;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
    }
    
    htmlEscDec2Hex "$1" | htmlHexUnescape \
        | sed -f named_entities.sed
    

    Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.

    This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:

    <!DOCTYPE html>
    <head><meta charset="utf-8" /></head>
    <body>
    <p id="sed-script"></p>
    <script type="text/javascript">
      const referenceURL = 'https://html.spec.whatwg.org/entities.json';
    
      function writeln(element, text) {
        element.appendChild( document.createTextNode(text) );
        element.appendChild( document.createElement("br") );
      }
    
      (async function(container) {
        const json = await (await fetch(referenceURL)).json();
        container.innerHTML = "";
        writeln(container, "#!/usr/bin/sed -f");
        const addLast = [];
        for (const name in json) {
          const characters = json[name].characters
            .replace("\\", "\\\\")
            .replace("/", "\\/");
          const command = "s/" + name + "/" + characters + "/g";
          if ( name.endsWith(";") ) {
            writeln(container, command);
          } else {
            addLast.push(command);
          }
        }
        for (const command of addLast) { writeln(container, command); }
      })( document.getElementById("sed-script") );
    </script>
    </body></html>
    

    Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.

    Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.

    For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:

    nLines=20
    seq 1 $nLines $(grep -c $ "$inputFile") | while read n
        do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
    done
    

    Explanation of the script follows.

    There are three types of escape sequences that need to be supported:

    1. &#D; where D is the decimal value of the escaped character’s Unicode code point;

    2. &#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;

    3. &N; where N is the name of one of the named entities for the escaped character.

    The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.

    The central piece of this method for supporting the code point escapes is the printf utility, which is able to:

    1. print numbers in hexadecimal format, and

    2. print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).

    The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.

    The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.

    0 讨论(0)
提交回复
热议问题