I\'m trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is pr
Just use awk:
awk -F '<[^>]+>' '
found { sub(/^[[:space:]]*/,";"); print title $0; found=0 }
/<div class="item_title">/ { title=$2 }
/<div class="item_desc">/ { found=1 }
' file
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:
/ *<div class="item_title">\(.*\)<\/div>/ { s//\1/; h; }
/ *<div class="item_desc">/,/<\/div>/ {
/<div class="item_desc">/d
/<\/div>/d
s/^ *//
G
s/\(.*\)\n\(.*\)/\2;\1/p
}
The first line matches item title lines. The s///
command captures just the part between the <div …>
and </div>
; the h
copies that into the hold space (memory).
The rest of the script matches lines between the item description <div>
and its </div>
. The first two lines delete (ignore) the <div>
and </div>
lines. The s///
removes leading spaces; the G
appends the hold space to the pattern space after a newline; the s///p
captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.
$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$
Note the -n
; that means "don't print unless told to do so".
You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ;
after the h
is necessary with BSD sed
and harmless but not crucial with GNU sed
.
There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:
/ *<div class="item_title">\(.*\)<\/div>/
could be revised to:
/^[[:space:]]*<div class="item_title">[[:space:]]*\(.*\)[[:space:]]*<\/div>[[:space:]]*$/
to deal with arbitrary sequences of white space before, in the middle, and after the <div>
components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.
You could also wrap the whole construct in the file inside:
/^<div class="result">$/,/^<\/div>$/ {
…script as before…
}
And you could repeat that idea so that the item title is only picked inside <div class="item">
and </div>
, etc.