I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:
BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.
It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.
Another option is to use the multi-platform xidel
utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:
xidel in.html -e '/html/body/node()' --printed-node-format=html
Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.
Example:
curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'
plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file
Personally I find it very useful to use hxselect
command (often with help of hxclean
) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c
option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.