I am doing some BASH shell scripting with curl
. If my curl command returns any text, I know I have an error. This text returned by curl
is usually in H
If you want to remove all HTML tags and also all script tags (and their contents), you can use the following:
sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g' $file -i && sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' $file -i && sed -r '/^\s*$/d' $file -i
Code for GNU sed:
sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file
This might fail, you should better use a html-parsing tool.
sed doesn't support non-greedy.
try
's/<[^>]*>//g'
Maybe parser-based perl solution?
perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html
You must install the HTML::Strip module with cpan HTML::Strip
command.
alternatively
you can use an standard OS X utility called: textutil
see the man page
textutil -convert txt file.html
will produce file.txt
with stripped html tags, or
textutil -convert txt -stdin -stdout < file.txt | some_command
Another alternative
Some systems get installed the lynx
text-only browser. You can use the:
lynx -dump file.html #or
lynx -stdin -dump < file.html
But in your case, you can rely only on pure sed
or awk
solutions... IMHO.
But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed
perl -0777 -pe 's/<.*?>//sg'
because will remove the next (multiline and common) tag too:
<a
href="#"
class="some"
>link text</a>