问题
Sorry if this comes off as confusing.
I have written a script using the NodeJS request module that runs and performs a function on a website then returns with the data. This script works perfectly fine when I do not use a proxy by setting it to false. This is not a task that is NOT allowed to be done with Selenium/puppeteer
proxy: false
However, when I set a (working) proxy. It fails to perform the same task and is detected by the website firewall/antibot software.
proxy: http://xx.xxx.xx.xx:3128
Some things to note:
- I have tried many (20+) different proxy providers (Residential and Datacenter) and they all have this issue
- The issue does not occur if that proxy is set globally on my system
- The issue does not occur if that proxy is set in a chrome extension
- The SSL cipher suites do not match Chrome but they still don't match when not using a proxy so I assume that isn't the issue
- It is very important to keep consistency in the header order
The question basically is. Does the request module change anything when using a proxy such as the header order?
Here is an image of what happens when it passes/fails.
The only difference is changing the proxy that causes this to fail. One request being made with, one request being made without.
url : url,
simple : false,
forever: true,
resolveWithFullResponse: true,
gzip: true,
headers: {
'Host' : 'www.sitename.com',
'Connection' : 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-encoding' : 'gzip, deflate, br',
'Accept-Language' : 'en-GB,en-US;q=0.9,en;q=0.8',
},
method : 'GET',
jar: globalJar,
simple: false,
followRedirect: false,
followAllRedirects: false,
回答1:
According to the proxies documentation of the request module:
By default, when proxying http traffic, request will simply make a standard proxied http request. This is done by making the url section of the initial line of the request a fully qualified url to the endpoint.
Instead you can use a http tunnel by setting:
tunnel : true
in the request module proxy settings.
It could be that in your case, you are making a standard proxied http request, whereas when using a proxy globally on your system or a chrome extension a http tunnel is created.
From the documentation:
Note that, when using a tunneling proxy, the proxy-authorization header and any headers from custom proxyHeaderExclusiveList are never sent to the endpoint server, but only to the proxy server.
回答2:
There are some scenarios that I can think of
- Proxy is actually adding some headers to the final request (in order to identify you to the server)
- The website you're trying to reach has your proxy IPs blacklisted (public/paid ones?)
It really depends on why you need to use that proxy
- Is it because of network restrictions?
- Is it because you want to hide the original request address?
Also, if you have control over the proxy server, can you log the requests being made to the final server?
My suggestion
Try writing your own proxy (a reverse one) and host it somewhere. Instead of requesting to https://target.com, to a request to your http[s]://proxy.com/ and let the reverse proxy do the work. Also, remember to disable X headers on the implementation as it will change the request headers
Reference for node.js implementation:
https://github.com/nodejitsu/node-http-proxy
Note: let me know about the questions I made in the comments
回答3:
You're using the http
-scheme for you request, but if the webserver redirects http
to https
and if the proxy-server is not configured to accept redirects (to https
) then the problem might only be about the scheme respectively the URL you enter.
So the proxy had to be configured to accept redirects or the URL has to be checked manually in the case of faults and then adjusted in the case of a redirect.
Here you can read about redirects on one proxy-server (Apache Traffic Server), the scenario there includes more redirects than I described above:
https://docs.trafficserver.apache.org/en/4.2.x/admin/reverse-proxy-http-redirects.en.html#handling-origin-server-redirect-responses
If you still encounter problems the server-logs of the proxy-server would be helpful.
EDIT:
According to he page @Jannes Botis linked there exist still more proxy-settings that might be able to support or disrupt the desired functionality, so the whole issue is perhaps about configuring the proxy-server correct. Here are a few settings that are directly related to redirects:
followRedirect - follow HTTP 3xx responses as redirects (default: true). This property can also be implemented as function which gets response object as a single argument and should return true if redirects should continue or false otherwise.
followAllRedirects - follow non-GET HTTP 3xx responses as redirects (default: false)
followOriginalHttpMethod - by default we redirect to HTTP method GET. you can enable this property to redirect to the original HTTP method (default: false)
maxRedirects - the maximum number of redirects to follow (default: 10)
removeRefererHeader - removes the referer header when a redirect happens (default: false). Note: if true, referer header set in the initial request is preserved during redirect chain.
It's quite possible that other settings of the proxy-server have impact on fail or success of your scenario too.
来源:https://stackoverflow.com/questions/55243887/how-to-stop-nodejs-request-module-changes-request-when-using-proxy