scrape ASIN from amazon URL using javascript

前端 未结 16 748
旧巷少年郎
旧巷少年郎 2021-01-30 11:42

Assuming I have an Amazon product URL like so

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVP         


        
相关标签:
16条回答
  • 2021-01-30 12:27

    None of the above work in all cases. I have tried following urls to match with the examples above:

    http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
    http://www.amazon.com/dp/B0015T963C
    http://www.amazon.com/gp/product/B0015T963C
    http://www.amazon.com/gp/product/glance/B0015T963C
    
    https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop
    
    https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN
    
    https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4
    
    https://www.amazon.de/dp/B01N32MQOA?psc=1
    

    This is the best I could come up with: (?:[/dp/]|$)([A-Z0-9]{10}) Which will also select the prepending / in all cases. This can then be removed later on.

    You can test it on: http://regexr.com/3gk2s

    0 讨论(0)
  • 2021-01-30 12:28

    something like this should work (not tested)

    var match = /\/dp\/(.*?)\/ref=amb_link/.exec(amazon_url);
    var asin = match ? match[1] : '';
    
    0 讨论(0)
  • 2021-01-30 12:31

    Amazon's detail pages can have several forms, so to be thorough you should check for them all. These are all equivalent:

    http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
    http://www.amazon.com/dp/B0015T963C
    http://www.amazon.com/gp/product/B0015T963C
    http://www.amazon.com/gp/product/glance/B0015T963C

    They always look like either this or this:

    http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
    http://www.amazon.com/gp/product/<VIEW>/ASIN
    

    This should do it:

    var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
    var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
    m = url.match(regex);
    if (m) { 
        alert("ASIN=" + m[4]);
    }
    
    0 讨论(0)
  • 2021-01-30 12:33

    this is my universal amazon ASIN regexp:

    ~(?:\b)((?=[0-9a-z]*\d)[0-9a-z]{10})(?:\b)~i
    
    0 讨论(0)
  • 2021-01-30 12:33

    Inspired by many of the answers here, I found that

    (?:[/])([A-Z0-9]{10})(?:[\/|\?|\&|\s|$])

    let url="https://www.amazon.com/Why-We-Sleep-Science-Dreams-ebook/dp/B06Y649387/ref=pd_sim_351_4/131-0417603-5732106?_encoding=UTF8&pd_rd_i=B06Y649387&pd_rd_r=5ebbfdd5-a2f6-4ee3-ad13-5036b5e20827&pd_rd_w=LBo2H&pd_rd_wg=OBomS&pf_rd_p=3c412f72-0ba4-4e48-ac1a-8867997981bd&pf_rd_r=TN0WDV3AC7ED4Y7EKNVP&psc=1&refRID=TN0WDV3AC7ED4Y7EKNVP"
    url.match("(?:[/])([A-Z0-9]{10})(?:[\/|\?|\&|\s])")
    
    >> Array [ "/B06Y649387/", "B06Y649387" ]
    

    works really well for extracting asin from anywhere in the url. You can try it out here. https://regexr.com/56jm7

    edit: Added end-of-string as one of the stopping checks. This is needed when the regex is used in python

    0 讨论(0)
  • 2021-01-30 12:33

    Try using this regex:

    (?:[/dp/]|$)([A-Z0-9]{10})
    

    Check out the demo: https://regexr.com/3gk2s

    0 讨论(0)
提交回复
热议问题