Using AWK/Grep/Bash to extract data from HTML

后端 未结 2 1608
误落风尘
误落风尘 2021-01-22 16:02

I\'m trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is pr

相关标签:
2条回答
  • 2021-01-22 16:10

    Just use awk:

    awk -F '<[^>]+>' '
        found { sub(/^[[:space:]]*/,";"); print title $0; found=0 }
        /<div class="item_title">/ { title=$2 }
        /<div class="item_desc">/  { found=1 }
    ' file
    ITEM 1;ITEM DESCRIPTION 1
    ITEM 2;ITEM DESCRIPTION 2
    
    0 讨论(0)
  • 2021-01-22 16:16

    A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:

    sed.script

    / *<div class="item_title">\(.*\)<\/div>/ { s//\1/; h; }
    / *<div class="item_desc">/,/<\/div>/ {
        /<div class="item_desc">/d
        /<\/div>/d
        s/^  *//
        G
        s/\(.*\)\n\(.*\)/\2;\1/p
    }
    

    The first line matches item title lines. The s/// command captures just the part between the <div …> and </div>; the h copies that into the hold space (memory).

    The rest of the script matches lines between the item description <div> and its </div>. The first two lines delete (ignore) the <div> and </div> lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.

    Example

    $ sed -n -f sed.script items.html
    ITEM 1;ITEM DESCRIPTION 1
    ITEM 2;ITEM DESCRIPTION 2
    $
    

    Note the -n; that means "don't print unless told to do so".

    You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.

    Modification

    There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:

    / *<div class="item_title">\(.*\)<\/div>/
    

    could be revised to:

    /^[[:space:]]*<div class="item_title">[[:space:]]*\(.*\)[[:space:]]*<\/div>[[:space:]]*$/
    

    to deal with arbitrary sequences of white space before, in the middle, and after the <div> components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.

    You could also wrap the whole construct in the file inside:

    /^<div class="result">$/,/^<\/div>$/ {
        …script as before…
    }
    

    And you could repeat that idea so that the item title is only picked inside <div class="item"> and </div>, etc.

    0 讨论(0)
提交回复
热议问题