A solution based on multi-platform web-scraping CLI xidel and XQuery:
xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file
With the sample input, this yields:
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Explanation:
XQuery query for $tr in //tr[position()>1] return join($tr/td, " ")
processes the tr
elements starting with the 2nd one (position()>1
, to skip the header row) in a loop, and joins the values of the child td
elements ($tr/td
) with a single space as the separator.
-s
makes xidel
silent (suppresses output of status information).
While html2text is convenient for display of the extracted data, providing machine-parseable output is non-trivial, unfortunately:
html2text file | awk -F' *\\|' 'NR>2 {gsub(/^\||.\b/, ""); $1=$1; print}'
The Awk command removes the hidden \b
-based (backspace-based) sequences that html2text
outputs by default, and parses the lines into fields by |
, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t'
).
Note: Use of -nobs
to suppress backspace sequences at the source is not an option, because you then won't be able to distinguish between the hidden-by-default _
instances used for padding and actual _
characters in the data.
Note: Given that html2text
seemingly invariably uses |
as the column separator, the above will only work robustly if the are no |
instances in the data being extracted.