Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path=\'/html/head/title\' |
sed -e \'s%(^|
While it seems like "never parse XML, JSON... from bash without a proper tool" is sound advice, I disagree. If this is side job, it is waistfull to look for the proper tool, then learn it... Awk can do it in minutes. My programs have to work on all above mentioned and more kinds of data. Hell, I do not want to test 30 tools to parse 5-7-10 different formats I need if I can awk the problem in minutes. I do not care about XML, JSON or whatever! I need a single solution for all of them.
As an example: my SmartHome program runs our homes. While doing it, it reads plethora of data in too many different formats I can not control. I never use dedicated, proper tools since I do not want to spend more than minutes on reading the data I need. With FS and RS adjustments, this awk solution works perfectly for any textual format. But, it may not be the proper answer when your primary task is to work primarily with loads of data in that format!
The problem of parsing XML from bash I faced yesterday. Here is how I do it for any hierarchical data format. As a bonus - I assign data directly to the variables in a bash script.
To make thins easier to read, I will present solution in stages. From the OP test data, I created a file: test.xml
Parsing said XML in bash and extracting the data in 90 chars:
awk 'BEGIN { FS="<|>"; RS="\n" }; /host|username|password|dbname/ { print $2, $4 }' test.xml
I normally use more readable version since it is easier to modify in real life as I often need to test differently:
awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2,$4}' test.xml
I do not care how is the format called. I seek only the simplest solution. In this particular case, I can see from the data that newline is the record separator (RS) and <> delimit fields (FS). In my original case, I had complicated indexing of 6 values within two records, relating them, find when the data exists plus fields (records) may or may not exist. It took 4 lines of awk to solve the problem perfectly. So, adapt idea to each need before using it!
Second part simply looks it there is wanted string in a line (RS) and if so, prints out needed fields (FS). The above took me about 30 seconds to copy and adapt from the last command I used this way (4 times longer). And that is it! Done in 90 chars.
But, I always need to get the data neatly into variables in my script. I first test the constructs like so:
awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml
In some cases I use printf instead of print. When I see everything looks well, I simply finish assigning values to variables. I know many think "eval" is "evil", no need to comment :) Trick works perfectly on all four of my networks for years. But keep learning if you do not understand why this may be bad practice! Including bash variable assignments and ample spacing, my solution needs 120 chars to do everything.
eval $( awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml ); echo "host: $host, username: $username, password: $password dbname: $dbname"