问题
I am new to web scraping, and I use the following tool and method to scrap:
- I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
- Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
- When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
- When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
- You have to load the url again (by entering a second time the url), in order to get the page 2.
- When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
- It seems that it is built with Oracle commerce (ATG commerce).
- The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
- How can I access to more products ?
Thank you
回答1:
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.
来源:https://stackoverflow.com/questions/37184509/web-scraping-oracle-atg-commerce