Convert the XPath gotten from browser to usable XPath for Scrapy

前端 未结 2 1768
暖寄归人
暖寄归人 2020-12-20 04:45

This is a problem that I always have getting a specific XPath with my browser.

Assume that I want to extract all the images from some websites like Google Image Sear

相关标签:
2条回答
  • 2020-12-20 05:26

    Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.

    First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).

    Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.

    Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:

    $ scrapy shell https://google.com
    >>> response.xpath('//div[@id="myid"]')
    ...
    

    Here is what I've got for the google image search:

    $ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
    In [1]: response.xpath('//*[@id="ires"]//img/@src').extract()
    Out[1]: 
    [u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
     ...
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
    
    0 讨论(0)
  • 2020-12-20 05:27

    The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.

    For the example you gave,

    //*[@id="rg_s"]/div[13]/a/img
    

    the 13th div is particularly prone to breakage.

    Try instead to find a uniquely identifying characteristic closer to your target. A unique @id attribute would be ideal, or a @class that uniquely identifies your target or a close ancestor of your target can work well too.

    For example, for Google Image Search, something like the following XPath

    //div[@id='rg_s']//img[@class='rg_i']"
    

    will select all images of class rg_i within the div containing the search results.

    If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).

    0 讨论(0)
提交回复
热议问题