Removing all HTML tags from a webpage

后端 未结 4 732
温柔的废话
温柔的废话 2021-02-10 18:13

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in H

相关标签:
4条回答
  • 2021-02-10 18:48

    If you want to remove all HTML tags and also all script tags (and their contents), you can use the following:

    sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g' $file -i && sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' $file -i && sed -r '/^\s*$/d' $file -i
    
    0 讨论(0)
  • 2021-02-10 18:52

    sed doesn't support non-greedy.

    try

    's/<[^>]*>//g'
    
    0 讨论(0)
  • 2021-02-10 19:07

    Code for GNU sed:

    sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file
    

    This might fail, you should better use a html-parsing tool.

    0 讨论(0)
  • 2021-02-10 19:11

    Maybe parser-based perl solution?

    perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html
    

    You must install the HTML::Strip module with cpan HTML::Strip command.

    alternatively

    you can use an standard OS X utility called: textutil see the man page

    textutil -convert txt file.html
    

    will produce file.txt with stripped html tags, or

    textutil -convert txt -stdin -stdout < file.txt | some_command
    

    Another alternative

    Some systems get installed the lynx text-only browser. You can use the:

    lynx -dump file.html #or
    lynx -stdin -dump < file.html
    

    But in your case, you can rely only on pure sed or awk solutions... IMHO.

    But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

    perl -0777 -pe 's/<.*?>//sg'
    

    because will remove the next (multiline and common) tag too:

    <a
     href="#"
     class="some"
    >link text</a>
    
    0 讨论(0)
提交回复
热议问题