Removing all HTML tags from a webpage

后端未结

关注

 4  1071

不要未来只要你来 2021-02-10 18:11

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in H

4条回答

逝去的感伤 (楼主)

2021-02-10 19:04
Maybe parser-based perl solution?
```
perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html
```
You must install the HTML::Strip module with cpan HTML::Strip command.

alternatively

you can use an standard OS X utility called: textutil see the man page
```
textutil -convert txt file.html
```
will produce file.txt with stripped html tags, or
```
textutil -convert txt -stdin -stdout < file.txt | some_command
```
Another alternative

Some systems get installed the lynx text-only browser. You can use the:
```
lynx -dump file.html #or
lynx -stdin -dump < file.html
```
But in your case, you can rely only on pure sed or awk solutions... IMHO.

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed
```
perl -0777 -pe 's/<.*?>//sg'
```
because will remove the next (multiline and common) tag too:
```
link text
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...