scrape ASIN from amazon URL using javascript

前端 未结 16 745
旧巷少年郎
旧巷少年郎 2021-01-30 11:42

Assuming I have an Amazon product URL like so

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVP         


        
相关标签:
16条回答
  • 2021-01-30 12:14

    @Gumbo: Your code works great!

    //JS Test: Test it into firebug.

    url = window.location.href;
    url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");
    

    I add a php function that makes the same thing.

    function amazon_get_asin_code($url) {
        global $debug;
    
        $result = "";
    
        $pattern = "([a-zA-Z0-9]{10})(?:[/?]|$)";
        $pattern = escapeshellarg($pattern);
    
        preg_match($pattern, $url, $matches);
    
        if($debug) {
            var_dump($matches);
        }
    
        if($matches && isset($matches[1])) {
            $result = $matches[1];
        } 
    
        return $result;
    }
    
    0 讨论(0)
  • 2021-01-30 12:15

    This may be a simplistic approach, but I have yet to find an error in it using any of the URL's provided in this thread that people say is an issue.

    Simply, I take the URL, split it on the "/" to get the discrete parts. Then loop through the contents of the array and bounce them off of the regex. In my case the variable i represents an object that has a property called RawURL to contain the raw url that I am working with and a property called VendorSKU that I am populating.

    try
                {
                    string[] urlParts = i.RawURL.Split('/');
                    Regex regex = new Regex(@"^[A-Z0-9]{10}");
    
                    foreach (string part in urlParts)
                    {
                        Match m = regex.Match(part);
                        if (m.Success)
                        {
                            i.VendorSKU = m.Value;
                        }
                    }
                }
                catch (Exception) { }
    

    So far, this has worked perfectly.

    0 讨论(0)
  • 2021-01-30 12:15

    You can get the ASIN number by getting/scraping that page content and then by getting value of element by id="ASIN". It will work in all the cases and you don not need to rely on regex.

    0 讨论(0)
  • 2021-01-30 12:18

    You can scrape ASIN codes from the data-asin attribute in the search results using XPath.

    For example $x('//@data-asin').map(function(v,i){return v.nodeValue}) can be ran in Chrome's console.

    0 讨论(0)
  • 2021-01-30 12:23

    Since the ASIN is always a sequence of 10 letters and/or numbers immediately after a slash, try this:

    url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")
    

    The additional (?:[/?]|$) after the ASIN is to ensure that only a full path segment is taken.

    0 讨论(0)
  • 2021-01-30 12:26

    A little bit of change to the regex of the first answer and it works on all the urls I have tested.

    var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
    m = url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");;
    print(m);
    if (m) { 
        print("ASIN=" + m[1]);
    }

    0 讨论(0)
提交回复
热议问题