I have the following HTML code:
Hello from the year 2020!
As of v2.9.9 of libxml, this behavior has been fixed in xmllint itself.
However, if you're using anything older than that, and don't want to build libxml from source just to get the fixed xmllint
, you'll need one of the other workarounds here. As of this writing, the latest CentOS 8, for example, is still using a version of libxml (2.9.7) that behaves the way the OP describes.
As I gather from this SO answer, it's theoretically possible to feed a command into the --shell
option of older (<2.9.9) versions of xmllint
, and this will produce each node on a separate line. However, you end up having to post-process it with sed
or grep
to remove the visual detritus of shell mode's (human-oriented) output. It's not ideal.
XMLStarlet, if available, offers another solution, but you do need to use xmlstarlet fo
to format your HTML fragment into valid XML before using xmlstarlet sel
to extract nodes:
echo '
<textarea name="command" class="setting-input fixed-width"
rows="9">1</textarea>
<textarea name="command" class="setting-input fixed-width"
rows="5">2</textarea>' \
| xmlstarlet fo -H -R \
| xmlstarlet sel -T -t -v '//textarea[@name="command"]' -n
If the Attempt to load network entity
message from the second xmlstarlet
invocation annoys you, just add 2>/dev/null
at the very end to suppress it (at the risk of suppressing other messages printed to standard error).
The XMLStarlet options explained (see also the user's guide):
fo -H -R
— format the output, expecting HTML input, and recovering as much bad input as possible
<html>
root node, making the fragment in the OP's example valid XMLsel -T -t -v //xpath -n
— select nodes based on XPath //xpath
-T
) instead of XML-t
) that returns the value (-v
) of the node rather than the node itself (allowing you to forgo using text()
in the XPath expression)-n
)Edit(s): Removed half-implemented xmllint --shell
solution because it was just bad. Added an XMLStarlet example that actually works with the OP's data.
I did the following, ugly trick, please feel free to provide a better solution.
Changed the HTML code by replacing </textarea>
with \n</textarea>
using the following command:
sed 's/\<\/textarea/\'$'\n\<\\/textarea/g' f
This is a wrapper script intended exactly to the purpose of newlines delimited output (for old releases of xmllint
):
#!/bin/bash
# wrapper script to
# - have newline delimited output on Xpath querys
# - implements --xpath on very old releases
/usr/bin/xmllint --xpath &>/dev/null
implements_xpath=$?
newlines_delimited_xmllint_version=20909
current_version=$(xmllint --version |& awk 'NR==1{print $NF;exit}')
args=( "$@" )
if [[ $@ == *--xpath* ]]; then
# iterate over positional parameters
for ((i=0; i<${#args}; i++)); do
if [[ ${args[i]} == --xpath ]]; then
xpath="${args[i+1]}"
unset args[i+1]
unset args[i]
break
fi
done
if [[ ($implements_xpath==0 && $current_version>=20909) || $file == - || $file == /dev/stdin || $xpath == / || $xpath == string\(* ]]
then
exec /usr/bin/xmllint "$@"
else
exec /usr/bin/xmllint "${args[@]}" --shell <<< "cat $xpath" | sed '1d;$d;s/^ ------- *$//;/^$/d'
fi
else
exec /usr/bin/xmllint "$@"
fi
Check latest revision: https://github.com/sputnick-dev/xmllint
Debian Buster in June 29 2020 have version 2.9.4 which is 4 years old.
Debian testing/experimental have 2.9.10, which is the fixed version.
Another way to install 2.9.10 with Debian last stable: https://serverfault.com/a/1022826/120473 (without taking the risk of crashing the apt
system)
Try this patch, which provides 2 options:
--xpath
: same as old --xpath
, with nodes separated by \n
.
--xpath0
: same as old --xpath
, with nodes separated by \0
.
Test input (a.html
):
<textarea name="command" class="setting-input fixed-width" rows="9">1</textarea><textarea name="command" class="setting-input fixed-width" rows="5">2</textarea>
Test command 1:
# xmllint --xpath '//textarea[@name="command"]/text()' --html a.html
Test output 1:
1
2
Test command 2:
# xmllint --xpath0 '//textarea[@name="command"]/text()' --html a.html | xargs -0 -n1
Test output 2:
1
2