How to write a web proxy in Python

谁都会走 提交于 2020-01-22 05:18:12

问题


I'm trying to write a web proxy in python. The goal is to visit a url like: http://proxyurl/http://anothersite.com/ and see he contents of http://anothersite.com just like you would normally. I've gotten decently far by abusing the requests library, but this isn't really the intended use of the requests framework. I've written proxies with twisted before, but I'm not sure how to connect this into what I'm trying to do. Here's where I'm at so far...

import os
import urlparse

import requests

import tornado.ioloop
import tornado.web
from tornado import template

ROOT = os.path.dirname(os.path.abspath(__file__))
path = lambda *a: os.path.join(ROOT, *a)

loader = template.Loader(path(ROOT, 'templates'))


class ProxyHandler(tornado.web.RequestHandler):
    def get(self, slug):
        if slug.startswith("http://") or slug.startswith("https://"):
            if self.get_argument("start", None) == "true":
                parsed = urlparse.urlparse(slug)
                self.set_cookie("scheme", value=parsed.scheme)
                self.set_cookie("netloc", value=parsed.netloc)
                self.set_cookie("urlpath", value=parsed.path)
            #external resource
            else:
                response = requests.get(slug)
                headers = response.headers
                if 'content-type' in headers:
                    self.set_header('Content-type', headers['content-type'])
                if 'length' in headers:
                    self.set_header('length', headers['length'])
                for block in response.iter_content(1024):
                    self.write(block)
                self.finish()
                return
        else:
            #absolute
            if slug.startswith('/'):
                slug = "{scheme}://{netloc}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    original_slug=slug,
                )
            #relative
            else:
                slug = "{scheme}://{netloc}{path}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    path=self.get_cookie('urlpath'),
                    original_slug=slug,
                )
        response = requests.get(slug)
        #get the headers
        headers = response.headers
        #get doctype
        doctype = None
        if '<!doctype' in response.content.lower()[:9]:
            doctype = response.content[:response.content.find('>')+1]
        if 'content-type' in headers:
           self.set_header('Content-type', headers['content-type'])
        if 'length' in headers:
            self.set_header('length', headers['length'])
        self.write(response.content)


application = tornado.web.Application([
    (r"/(.+)", ProxyHandler),
])

if __name__ == "__main__":
    application.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

Just a note, I set a cookie to preserve the scheme, netloc, and urlpath if the there's start=true in the querystring. That way, any relative or absolute link that then hits the proxy uses that cookie to resolve the full url.

With this code, if you go to http://localhost:8888/http://espn.com/?start=true you'll see the contents of ESPN. However, on the following site it doesn't work at all: http://www.bottegaveneta.com/us/shop/. My question is, what's the best way to do this? Is the current way I'm implementing this robust or are there some terrible pitfalls to doing it this way? If it is correct, why are certain sites like the one I pointed out not working at all?

Thank you for any help.


回答1:


I have recently wrote a similiar web-application. Note that this is the way I did it. I'm not saying you should do it like this. These are some of the pitfalls I came across:

Changing attribute values from relative to absolute

There is much more involved than just fetching a page and presenting it to the client. Many times you're not able to proxy the webpage without any errors.

Why are certain sites like the one I pointed out not working at all?

Many webpages rely on relative paths to resources in order to display the webpage in a well formatted manner. For example, this image tag:

<img src="/header.png" />

Will result in the client doing a request to:

http://proxyurl/header.png

Which fails. The 'src' value should be converted to:

http://anothersite.com/header.png.

So, you need to parse the HTML document with something like BeautifulSoup, loop over all the tags and check for attributes such as:

'src', 'lowsrc', 'href'

And change their values accordingly so that the tag becomes:

<img src="http://anothersite.com/header.png" />

This method applies to more tags than just the image one. a, script, link, li and frame are a few you should change as well.

HTML shenanigans

The prior method should get you far, but you're not done yet.

Both

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

And

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;background-repeat:no-repeat;background-position:right top;" >

are examples of code that's difficult to reach and modify using BeautifulSoup.

In the first example there is a css @Import to a relative uri. The second one concerns the 'url()' method from an inline CSS statement.

In my situation, I ended up writing horrible code to manually modify these values. You may want to use Regex for this but I'm not sure.

Redirects

With Python-Requests or Urllib2 you can easily follow redirects automatically. Just remember to save what the new (base)uri is; you'll need it for the 'changing the attributes values from relative to absolute' operation.

You also need to deal with 'hardcoded' redirects. Such as this one:

<meta http-equiv="refresh" content="0;url=http://new-website.com/">

Needs to be changed to:

<meta http-equiv="refresh" content="0;url=http://proxyurl/http://new-website.com/">

Base tag

The base tag specifies the base URL/target for all relative URLs in a document. You probably want to change the value.

Finally done?

Nope. Some websites rely heavily on javascript to draw their content on screen. These sites are the hardest to proxy. I've been thinking about using something like PhantomJS or Ghost to fetch and evaluate webpages and presenting the result to the client.

Maybe my source code can help you. You can use it in any way you want.




回答2:


If you want to make a real proxy, you can use:

tornado-proxy

or

simple proxy based on Twisted

But I think it won't be hard to adapt them for your case.




回答3:


I think you don't need your last if block. This seems to work for me:

class ProxyHandler(tornado.web.RequestHandler):
    def get(self, slug):
        print 'get: ' + str(slug)

        if slug.startswith("http://") or slug.startswith("https://"):
            if self.get_argument("start", None) == "true":
                parsed = urlparse.urlparse(slug)
                self.set_cookie("scheme", value=parsed.scheme)
                self.set_cookie("netloc", value=parsed.netloc)
                self.set_cookie("urlpath", value=parsed.path)
            #external resource
            else:
                response = requests.get(slug)
                headers = response.headers
                if 'content-type' in headers:
                    self.set_header('Content-type', headers['content-type'])
                if 'length' in headers:
                    self.set_header('length', headers['length'])
                for block in response.iter_content(1024):
                    self.write(block)
                self.finish()
                return
        else:

            slug = "{scheme}://{netloc}/{original_slug}".format(
                scheme=self.get_cookie('scheme'),
                netloc=self.get_cookie('netloc'),
                original_slug=slug,
            )
            print self.get_cookie('scheme')
            print self.get_cookie('netloc')
            print self.get_cookie('urlpath')
            print slug
        response = requests.get(slug)
        #get the headers
        headers = response.headers
        #get doctype
        doctype = None
        if '<!doctype' in response.content.lower()[:9]:
            doctype = response.content[:response.content.find('>')+1]
        if 'content-type' in headers:
           self.set_header('Content-type', headers['content-type'])
        if 'length' in headers:
            self.set_header('length', headers['length'])
        self.write(response.content)



回答4:


You can use the socket module in the standard library and if you are using Linux epoll as well.

You can see example code of a simple async server here: https://github.com/aychedee/octopus/blob/master/octopus/server.py




回答5:


Apparently I am quite late in answering this, but just stumbled upon it awhile back. I have been writing something similar to your requirements myself.

It's more of an HTTP repeater, but the first of it's task is the proxy itself. It is not totally complete yet and there is no read me for it for now -- but those are on my todo list.

I have used mitmproxy for achieving this. It may not be the most elegant piece of code out there and I have used a lot of hacks here and there to achieve the repeater functionality. I know mitmproxy by default has ways to achieve the repeater thingy easily, but there was some certain requirement in my case where I could not use those features offered by mitmproxy.

You may find the project at https://github.com/c0n71nu3/python_repeater/ The repo is still being update by me as and when there are any developments.

Hopefully, it would be able to serve some help to you.



来源:https://stackoverflow.com/questions/16524545/how-to-write-a-web-proxy-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!