Removing all HTML tags from a webpage

后端 未结 4 1071

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in H

4条回答
  •  逝去的感伤
    2021-02-10 19:04

    Maybe parser-based perl solution?

    perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html
    

    You must install the HTML::Strip module with cpan HTML::Strip command.

    alternatively

    you can use an standard OS X utility called: textutil see the man page

    textutil -convert txt file.html
    

    will produce file.txt with stripped html tags, or

    textutil -convert txt -stdin -stdout < file.txt | some_command
    

    Another alternative

    Some systems get installed the lynx text-only browser. You can use the:

    lynx -dump file.html #or
    lynx -stdin -dump < file.html
    

    But in your case, you can rely only on pure sed or awk solutions... IMHO.

    But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

    perl -0777 -pe 's/<.*?>//sg'
    

    because will remove the next (multiline and common) tag too:

    link text
    

提交回复
热议问题