What's the best method to EXTRACT product names given a list of SKU numbers from a website?

问题

I have a problem.

I have a list of SKU numbers (hundreds) that I'm trying to match with the title of the product that it belongs to. I have thought of a few ways to accomplish this, but I feel like I'm missing something... I'm hoping someone here has a quick and efficient idea to help me get this done.

The products come from Aidan Gray.

Attempt #1 (Batch Program Method) - FAIL:

After searching for a SKU in Aidan Gray, the website returns a URL that looks like below:

http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER

... with "SKUNUMBER" obviously being a SKU.

The first result of the webpage is almost always the product.

To click the first result (through the address bar) the following can be entered (if Javascript is enabled through the address bar):

javascript:{document.getElementsByClassName("product-image")[0].click;}

I wanted to create a .bat file through Command Prompt and execute the following command:

firefox http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER javascript:{document.getElementsByClassName("product-image")[0].click;}

... but Firefox doesn't seem to allow these two commands to execute in the same tab.

If that worked, I was going to go to http://tools.buzzstream.com/meta-tag-extractor, paste the resulting links to get the titles of the pages, and export the data to CSV format, and copy over the data I wanted.

Unfortunately, I am unable to open both the webpage and the Javascript in the same tab through a batch program.

Attempt #2 (I'm Feeling Lucky Method):

I was going to use Google's &btnI URL suffix to automatically redirect to the first result.

http://www.google.com/search?btnI&q=site:aidangrayhome.com+SKUNUMBER

After opening all the links in tabs, I was going to use a Firefox add-on called "Send Tab URLs" to copy the names of the tabs (which contain the product names) to the clipboard.

The problem is that most of the results were simply not lucky enough...

If anybody has an idea or tip to get this accomplished, I'd be very grateful.

回答1:

I recommend using JScript for this. It's easy to include as hybrid code in a batch script, its structure and syntax is familiar to anyone comfortable with JavaScript, and you can use it to fetch web pages via XMLHTTPRequest (a.k.a. Ajax by the less-informed) and build a DOM object from the .responseText using an htmlfile COM object.

Anyway, challenge: accepted. Save this with a .bat extension. It'll look for a text file containing SKUs, one per line, and fetch and scrape the search page for each, writing info from the first anchor element with a .className of "product-image" to a CSV file.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "skufile=sku.txt"
set "outfile=output.csv"
set "URL=http://www.aidangrayhome.com/catalogsearch/result/?q="

rem // invoke JScript portion
cscript /nologo /e:jscript "%~f0" "%skufile%" "%outfile%" "%URL%"

echo Done.

rem // end main runtime
goto :EOF

@end // end batch / begin JScript chimera

var fso = WSH.CreateObject('scripting.filesystemobject'),
    skufile = fso.OpenTextFile(WSH.Arguments(0), 1),
    skus = skufile.ReadAll().split(/\r?\n/),
    outfile = fso.CreateTextFile(WSH.Arguments(1), true),
    URL = WSH.Arguments(2);

skufile.Close();

String.prototype.trim = function() { return this.replace(/^\s+|\s+$/g, ''); }

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    WSH.StdErr.Write('fetching ' + url);

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write(XHR.responseText);
    return DOM;
}

function out(what) {
    WSH.StdErr.Write(new Array(79).join(String.fromCharCode(8)));
    WSH.Echo(what);
    outfile.WriteLine(what);
}

WSH.Echo('Writing to ' + WSH.Arguments(1) + '...')
out('sku,product,URL');

for (var i=0; i<skus.length; i++) {
    if (!skus[i]) continue;

    var DOM = fetch(URL + skus[i]),
        anchors = DOM.getElementsByTagName('a');

    for (var j=0; j<anchors.length; j++) {
        if (/\bproduct-image\b/i.test(anchors[j].className)) {
            out(skus[i]+',"' + anchors[j].title.trim() + '","' + anchors[j].href + '"');
            break;
        }
    }
}

outfile.Close();

Too bad the htmlfile COM object doesn't support getElementsByClassName. :/ But this seems to work well enough in my testing.

来源：https://stackoverflow.com/questions/29267846/whats-the-best-method-to-extract-product-names-given-a-list-of-sku-numbers-from

标签

javascript

batch-file

extract

text-extraction

skus