Using AWK/Grep/Bash to extract data from HTML

后端未结

关注

 2  1609

I\'m trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is pr

相关标签:

2条回答

暗喜

2021-01-22 16:10

Just use awk:

awk -F '<[^>]+>' '
    found { sub(/^[[:space:]]*/,";"); print title $0; found=0 }
    /<div class="item_title">/ { title=$2 }
    /<div class="item_desc">/  { found=1 }
' file
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2

0 讨论(0)

梦毁少年i

2021-01-22 16:16
A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:

sed.script
```
/ *<div class="item_title">$.*$<\/div>/ { s//\1/; h; }
/ *<div class="item_desc">/,/<\/div>/ {
    /<div class="item_desc">/d
    /<\/div>/d
    s/^  *//
    G
    s/$.*$\n$.*$/\2;\1/p
}
```
The first line matches item title lines. The s/// command captures just the part between the <div …> and </div>; the h copies that into the hold space (memory).

The rest of the script matches lines between the item description <div> and its </div>. The first two lines delete (ignore) the <div> and </div> lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.

Example
```
$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$
```
Note the -n; that means "don't print unless told to do so".

You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.

Modification

There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:
```
/ *<div class="item_title">$.*$<\/div>/
```
could be revised to:
```
/^[[:space:]]*<div class="item_title">[[:space:]]*$.*$[[:space:]]*<\/div>[[:space:]]*$/
```
to deal with arbitrary sequences of white space before, in the middle, and after the <div> components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.

You could also wrap the whole construct in the file inside:
```
/^<div class="result">$/,/^<\/div>$/ {
    …script as before…
}
```
And you could repeat that idea so that the item title is only picked inside <div class="item"> and </div>, etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Using AWK/Grep/Bash to extract data from HTML

sed.script

Example

Modification