Scrapy, hash tag on URLs

后端未结

关注

 3  901

半阙折子戏

I\'m on the middle of a scrapping project using Scrapy.

I realized that Scrapy strips the URL from a hash tag to the end.

Here\'s the output from the shell:<

相关标签:

3条回答

情书的邮戳

2020-12-21 06:00

This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.

What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.

For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-21 06:06

Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.

It is retrievable from Javascript - as window.location.hash. From there you could send it to the server with Ajax for example, or encode it and put it into URLs which can then be passed through to the server-side.

Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?

Why do you need this part which is stripped if the server doesn't receive it from browser? If you are working with Amazon - i haven't seen any problems with such urls.

0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-21 06:08

Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

0 讨论(0)
发布评论:

提交评论
- 加载中...