Using AWK/Grep/Bash to extract data from HTML

后端未结

关注

 2  1610

误落风尘 2021-01-22 16:02

I\'m trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is pr

2条回答

梦毁少年i (楼主)

2021-01-22 16:16
A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:

sed.script
```
/ *$.*$<\/div>/ { s//\1/; h; }
/ */,/<\/div>/ {
    //d
    /<\/div>/d
    s/^  *//
    G
    s/$.*$\n$.*$/\2;\1/p
}
```
The first line matches item title lines. The s/// command captures just the part between the
and
; the h copies that into the hold space (memory).

The rest of the script matches lines between the item description
and its
. The first two lines delete (ignore) the
and
lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.

Example
```
$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$
```
Note the -n; that means "don't print unless told to do so".

You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.

Modification

There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:
```
/ *$.*$<\/div>/
```
could be revised to:
```
/^[[:space:]]*[[:space:]]*$.*$[[:space:]]*<\/div>[[:space:]]*$/
```
to deal with arbitrary sequences of white space before, in the middle, and after the
components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.

You could also wrap the whole construct in the file inside:
```
/^$/,/^<\/div>$/ {
    …script as before…
}
```
And you could repeat that idea so that the item title is only picked inside
and
, etc.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

Using AWK/Grep/Bash to extract data from HTML

sed.script

Example

Modification