Bash script to convert from HTML entities to characters

前端 未结 10 600
悲哀的现实
悲哀的现实 2020-12-02 09:59

I\'m looking for a way to turn this:

hello < world

to this:

hello < world

I could use sed, but

相关标签:
10条回答
  • 2020-12-02 10:23

    With Xidel:

    echo 'hello &lt; &#x3a; &quot; world' | xidel -s - -e 'parse-html($raw)'
    hello < : " world
    
    0 讨论(0)
  • 2020-12-02 10:33

    With perl:

    cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
    

    With php from the command line:

    cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'
    
    0 讨论(0)
  • 2020-12-02 10:33

    An alternative is to pipe through a web browser -- such as:

    echo '&#33;' | w3m -dump -T text/html

    This worked great for me in cygwin, where downloading and installing distributions are difficult.

    This answer was found here

    0 讨论(0)
  • 2020-12-02 10:33

    This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

    sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'
    

    Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers


    Edit June 26, 2017

    Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

    Here's the function:

    #-------------------------------------------------------------------------------
    LineOut=""      # Make global
    HTMLtoText () {
        LineOut=$1  # Parm 1= Input line
        # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
        # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
        # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
        LineOut="${LineOut//&nbsp;/ }"
        LineOut="${LineOut//&amp;/&}"
        LineOut="${LineOut//&lt;/<}"
        LineOut="${LineOut//&gt;/>}"
        LineOut="${LineOut//&quot;/'"'}"
        LineOut="${LineOut//&#39;/"'"}"
        LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
        LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
    } # HTMLtoText ()
    
    0 讨论(0)
提交回复
热议问题