Windows Batch / parse data from html web page

前端 未结 2 920
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-07 06:28

Is it possible to parse data from web html page using windows batch?

let\'s say I have a web page: www.domain.com/data/page/1 Page source html:

...
&         


        
相关标签:
2条回答
  • 2021-01-07 06:56

    It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).

    The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.

    @if (@CodeSection == @Batch) @then
    @echo off & setlocal
    
    set "url=http://www.domain.com/data/page/1"
    
    for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
        rem // do something useful with %%I
        echo Link found: %%I
    )
    
    goto :EOF
    @end // end batch / begin JScript hybrid code
    
    // returns a DOM root object
    function fetch(url) {
        var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
            DOM = WSH.CreateObject('htmlfile');
    
        XHR.open("GET",url,true);
        XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
        XHR.send('');
        while (XHR.readyState!=4) {WSH.Sleep(25)};
        DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
        DOM.write(XHR.responseText);
        return DOM;
    }
    
    var DOM = fetch(WSH.Arguments(0)),
        links = DOM.getElementsByTagName('a');
    
    for (var i in links)
        if (links[i].href && /\/post\/view\//i.test(links[i].href))
            WSH.Echo(links[i].href);
    
    0 讨论(0)
  • 2021-01-07 07:00

    If you just need to get /post/view/664654, you can use grep command, e.g.

    grep -o '/post/view/[^"]\+' *.html
    

    For parsing more complex HTML, you can use HTML-XML-utils or pup.

    0 讨论(0)
提交回复
热议问题