How do I print the contents of an XML element - from the starting tag to the closing tag - using AWK?
For example, consider the following XML:
$ awk -v tag='city' '$0~"^<"tag"\\>"{inTag=1} inTag; $0~"^</"tag">"{inTag=0}' file
<city id="AT">
<cityname>Athens</cityname>
<state>GA</state>
<description> Home of the University of Georgia</description>
<population>100,000</population>
<location>Located about 60 miles Northeast of Atlanta</location>
<latitude>33 57' 39" N</latitude>
<longitude>83 22' 42" W</longitude>
</city>
Using GNU awk above for \>
word boundary functionality. With other awks use [^[:alnum:]_]
or similar.
To only print the first occurrence:
$ awk -v tag='city' '$0~"^<"tag"\\>"{inTag=1} inTag{print; if ($0~"^</"tag">") exit}' file
<city id="AT">
<cityname>Athens</cityname>
<state>GA</state>
<description> Home of the University of Georgia</description>
<population>100,000</population>
<location>Located about 60 miles Northeast of Atlanta</location>
<latitude>33 57' 39" N</latitude>
<longitude>83 22' 42" W</longitude>
</city>
Solutions that parse XML with tools like awk and sed are imperfect. You cannot rely on XML always having a human readable layout. For example some web services will omit new-lines, resulting in the entire XML document appearing on one line.
I would recommend using xmllint, which has the ability to select nodes using XPATH, a query language designed for XML.
The following command will select the city tags:
xmllint --xpath "//city" data.xml
XPath is extremely useful. It makes the every part of the XML document addressable:
xmllint --xpath "string(//city[1]/@id)" data.xml
Returns the string "AT".
This time return the first occurrence of the "city" tag. xmllint can also be used to pretty print the result:
$ xmllint --xpath "//city[1]" data.xml | xmllint -format -
<?xml version="1.0"?>
<city id="AT">
<cityname>Athens</cityname>
<state>GA</state>
<description> Home of the University of Georgia</description>
<population>100,000</population>
<location>Located about 60 miles Northeast of Atlanta</location>
<latitude>33 57' 39" N</latitude>
<longitude>83 22' 42" W</longitude>
</city>
In this same data the first "city" tag appears all on one line. This is valid XML.
<data>
<flight>
<airline>Delta</airline>
<flightno>22</flightno>
<origin>Atlanta</origin>
<destination>Paris</destination>
<departure>5:40pm</departure>
<arrival>8:10am</arrival>
</flight>
<city id="AT"> <cityname>Athens</cityname> <state>GA</state> <description> Home of the University of Georgia</description> <population>100,000</population> <location>Located about 60 miles Northeast of Atlanta</location> <latitude>33 57' 39" N</latitude> <longitude>83 22' 42" W</longitude> </city>
<city id="DUB">
<cityname>Dublin</cityname>
<state>Dub</state>
<description> Dublin</description>
<population>1,500,000</population>
<location>Ireland</location>
<latitude>NA</latitude>
<longitude>NA</longitude>
</city>
</data>