问题
im using python3.3 and the requests module to scrape links from an arbitrary webpage. My program works as follows: I have a list of urls which in the beginning has just the starting url in it. The program loops over that list and gives the urls to a procedure GetLinks, where im using requests.get and Beautifulsoup to extract all links. Before that procedure appends links to my urllist it gives them to another procedure testLinks to see whether its an internal, external or broken link. In the testLinks im using requests.get too to be able to handle redirects etc.
The program worked really well so far, i tested it with quite some wesites and was able to get all links of pages with like 2000 sites etc. But yesterday i encountered a problem on one page, by looking on the Kaspersky Network Monitor. On this page some TCP connections just dont reset, it seems to me that in that case, the initial request for my first url dont get reset, the connection time is as long as my program runs.
Ok so far. My first try was to use requests.head instead of .get in my testLinks procedure. And then everything works fine! The connections are released as wanted. But the problem is, the information i get from requests.head is not sufficient, im not able to see the redirected url and how many redirects took place. Then i tried requests.head with
allow_redirects=True
But unfortunately this is not a real .head request, it is a usual .get request. So i got the same problem. I also tried to use to set the parameter
keep_alive=False
but it didnt work either. I even tried to use urllib.request(url).geturl() in my testLinks for redirect issues, but here the same problem occurs, the TCP connections dont get reset. I tried so much to avoid this problem, i used request sessions but it also had the same problem. I also tried a request.post with the header information Connection: close but it didnt worked.
I analyzed some links where i think it gets struck and so far i believe it has something to do with redirects like 301->302. But im really not sure because on all the other websites i tested it there mustve been such a redirect, they are quite common.
I hope someone can help me. For Information im using a VPN connection to be able to see all websites, because the country im in right now blocks some pages, which are interesting for me. But of course i tested it without the VPN and i had the same problem.
Maybe theres a workaround, because request.head in testLinks is sufficient if i just would be able in case of redirects to see the finnish url and maybe the number of redirects.
If the text is not well readable, i will provide a scheme of my code.
Thanks alot!
来源:https://stackoverflow.com/questions/18024285/big-requests-issue-get-doesnt-release-reset-tcp-connections-loop-crashes