Web scraping Oracle (ATG) Commerce

回眸只為那壹抹淺笑 提交于 2020-01-06 21:45:44

问题


I am new to web scraping, and I use the following tool and method to scrap:

  • I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
  • Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
  • When I know in which node the data are, I use xpathApply to get them.

Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2

  • When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
  • You have to load the url again (by entering a second time the url), in order to get the page 2.
  • When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.

I tried to understand more this web site:

  • It seems that it is built with Oracle commerce (ATG commerce).
  • The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099

This doesn't help to know which selection I made.

Could you please help:

  • How can I access to more products ?

Thank you


回答1:


I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.



来源:https://stackoverflow.com/questions/37184509/web-scraping-oracle-atg-commerce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!