Scrapy, hash tag on URLs

后端 未结 3 901
半阙折子戏
半阙折子戏 2020-12-21 05:10

I\'m on the middle of a scrapping project using Scrapy.

I realized that Scrapy strips the URL from a hash tag to the end.

Here\'s the output from the shell:<

相关标签:
3条回答
  • 2020-12-21 06:00

    This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.

    What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.

    For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

    0 讨论(0)
  • 2020-12-21 06:06

    Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.

    It is retrievable from Javascript - as window.location.hash. From there you could send it to the server with Ajax for example, or encode it and put it into URLs which can then be passed through to the server-side.

    Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?

    Why do you need this part which is stripped if the server doesn't receive it from browser? If you are working with Amazon - i haven't seen any problems with such urls.

    0 讨论(0)
  • 2020-12-21 06:08

    Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

    0 讨论(0)
提交回复
热议问题