Python Requests - Is it possible to receive a partial response after an HTTP POST?

两盒软妹~` 提交于 2021-02-07 13:19:41

问题


I am using the Python Requests Module to datamine a website. As part of the datamining, I have to HTTP POST a form and check if it succeeded by checking the resulting URL. My question is, after the POST, is it possible to request the server to not send the entire page? I only need to check the URL, yet my program downloads the entire page and consumes unnecessary bandwidth. The code is very simple

import requests
r = requests.post(URL, payload)
if 'keyword' in r.url:
   success
fail

回答1:


An easy solution, if it's implementable for you. Is to go low-level. Use socket library. For example you need to send a POST with some data in its body. I used this in my Crawler for one site.

import socket
from urllib import quote # POST body is escaped. use quote

req_header = "POST /{0} HTTP/1.1\r\nHost: www.yourtarget.com\r\nUser-Agent: For the lulz..\r\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\r\nContent-Length: {1}"
req_body = quote("data1=yourtestdata&data2=foo&data3=bar=")
req_url = "test.php"
header = req_header.format(req_url,str(len(req_body))) #plug in req_url as {0} 
                                                       #and length of req_body as Content-length
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)   #create a socket
s.connect(("www.yourtarget.com",80))                   #connect it
s.send(header+"\r\n\r\n"+body+"\r\n\r\n")              # send header+ two times CR_LF + body + 2 times CR_LF to complete the request

page = ""
while True:
    buf = s.recv(1024) #receive first 1024 bytes(in UTF-8 chars), this should be enought to receive the header in one try
    if not buf:
        break
    if "\r\n\r\n" in page: # if we received the whole header(ending with 2x CRLF) break
        break
    page+=buf
s.close()       # close the socket here. which should close the TCP connection even if data is still flowing in
                # this should leave you with a header where you should find a 302 redirected and then your target URL in "Location:" header statement.



回答2:


There's a chance the site uses Post/Redirect/Get (PRG) pattern. If so then it's enough to not follow redirect and read Location header from response.

Example

>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'

If you need more information on what would you get if you had followed redirection then you can use HEAD on the url given in Location header.

Example

>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
>>> response2 = requests.head(response.headers['location'])
>>> response2.status_code
200
>>> response2.headers
{'date': 'Wed, 07 Nov 2012 20:04:16 GMT', 'content-length': '352', 'content-type':
'application/json', 'connection': 'keep-alive', 'server': 'gunicorn/0.13.4'}



回答3:


It would help if you gave some more data, for example, a sample URL that you're trying to request. That being said, it seems to me that generally you're checking if you had the correct URL after your POST request using the following algorithm relying on redirection or HTTP 404 errors:

if original_url == returned request url:
    correct url to a correctly made request
else:
    wrong url and a wrongly made request

If this is the case, what you can do here is use the HTTP HEAD request (another type of HTTP request like GET, POST, etc.) in Python's requests library to get only the header and not also the page body. Then, you'd check the response code and redirection url (if present) to see if you made a request to a valid URL.

For example:

def attempt_url(url):
    '''Checks the url to see if it is valid, or returns a redirect or error.
    Returns True if valid, False otherwise.'''

    r = requests.head(url)
    if r.status_code == 200:
        return True
    elif r.status_code in (301, 302):
        if r.headers['location'] == url:
            return True
        else:
            return False
    elif r.status_code == 404:
        return False
    else:
        raise Exception, "A status code we haven't prepared for has arisen!"

If this isn't quite what you're looking for, additional detail on your requirements would help. At the very least, this gets you the status code and headers without pulling all of the page data.



来源:https://stackoverflow.com/questions/13170258/python-requests-is-it-possible-to-receive-a-partial-response-after-an-http-pos

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!