Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path=\'/html/head/title\' |
sed -e \'s%(^|
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=\> ; read -d \< E C ;}
becomes:
rdom () { local IFS=\< ; read -d \> C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
You can do that very easily using only bash. You only have to add this function:
rdom () { local IFS=\> ; read -d \< E C ;}
Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4