可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Assuming I have an Amazon product URL like so
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846
How could I scrape just the ASIN using javascript? Thanks!
回答1:
Amazon's detail pages can have several forms, so to be thorough you should check for them all. These are all equivalent:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C
They always look like either this or this:
http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN http://www.amazon.com/gp/product/<VIEW>/ASIN
This should do it:
var url = "http://rads.stackoverflow.com/amzn/click/B0015T963C"; var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})"); m = url.match(regex); if (m) { alert("ASIN=" + m[4]); }
回答2:
Since the ASIN is always a sequence of 10 letters and/or numbers immediately after a slash, try this:
url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")
The additional (?:[/?]|$)
after the ASIN is to ensure that only a full path segment is taken.
回答3:
Actually, the top answer doesn't work if it's something like amazon.com/BlackBerry... (since BlackBerry is also 10 characters).
One workaround (assuming the ASIN is always capitalized, as it always is when taken from Amazon) is (in Ruby):
url.match("/([A-Z0-9]{10})")
I've found it to work on thousands of URLs.
回答4:
None of the above work in all cases. I have tried following urls to match with the examples above:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C http://www.amazon.com/dp/B0015T963C http://www.amazon.com/gp/product/B0015T963C http://www.amazon.com/gp/product/glance/B0015T963C https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4 https://www.amazon.de/dp/B01N32MQOA?psc=1
This is the best I could come up with: (?:[/dp/]|$)([A-Z0-9]{10})
Which will also select the prepending / in all cases. This can then be removed later on.
You can test it on: http://regexr.com/3gk2s
回答5:
@Gumbo: Your code works great!
//JS Test: Test it into firebug.
url = window.location.href; url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");
I add a php function that makes the same thing.
function amazon_get_asin_code($url) { global $debug; $result = ""; $pattern = "([a-zA-Z0-9]{10})(?:[/?]|$)"; $pattern = escapeshellarg($pattern); preg_match($pattern, $url, $matches); if($debug) { var_dump($matches); } if($matches && isset($matches[1])) { $result = $matches[1]; } return $result; }
回答6:
this is my universal amazon ASIN regexp:
~(?:\b)((?=[0-9a-z]*\d)[0-9a-z]{10})(?:\b)~i
回答7:
something like this should work (not tested)
var match = /\/dp\/(.*?)\/ref=amb_link/.exec(amazon_url); var asin = match ? match[1] : '';
回答8:
The Wikipedia article on ASIN (which I've linkified in your question) gives the various forms of Amazon URLs. You can fairly easily create a regular expression (or series of them) to fetch this data using the match()
method.
回答9:
This may be a simplistic approach, but I have yet to find an error in it using any of the URL's provided in this thread that people say is an issue.
Simply, I take the URL, split it on the "/" to get the discrete parts. Then loop through the contents of the array and bounce them off of the regex. In my case the variable i represents an object that has a property called RawURL to contain the raw url that I am working with and a property called VendorSKU that I am populating.
try { string[] urlParts = i.RawURL.Split('/'); Regex regex = new Regex(@"^[A-Z0-9]{10}"); foreach (string part in urlParts) { Match m = regex.Match(part); if (m.Success) { i.VendorSKU = m.Value; } } } catch (Exception) { }
So far, this has worked perfectly.
回答10:
A little bit of change to the regex of the first answer and it works on all the urls I have tested.
var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C"; m = url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");; print(m); if (m) { print("ASIN=" + m[1]); }
回答11:
If the ASIN is always in that position in the URL:
var asin= decodeURIComponent(url.split('/')[5]);
though there's probably little chance of an ASIN getting %-escaped.