问题
I am scraping a website which has Oracle ADF loopback script which continuously redirects me to same page of mine, so how to bypass it?
Following is my php code.
<?php
$url = 'https://www.mywebsite.com/faces/index.jspx';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . '/cookie.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$header[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
if (curl_errno($ch)) { // check for execution errors
echo 'Scraper error: ' . curl_error($ch);
exit;
}
echo $data;
?>
When i run above code i got redirected to same page,
and it also adds some query string parameters like ?_afrLoop=39478247795404&_afrWindowMode=0&_afrWindowId=null
in actual site _afrWindowId
has some random alphanumeric string but i am getting null
.
after stopping page redirection manually i got page which has Oracle loopback script as following
which causes the redirection, what to do help me.
loopback script:
<html lang="el-GR"><head><script>
/*
** Copyright (c) 2008, Oracle and/or its affiliates. All rights reserved.
*/
/**
* This is the loopback script to process the url before the real page loads. It introduces
* a separate round trip. During this first roundtrip, we currently do two things:
* - check the url hash portion, this is for the PPR Navigation.
* - do the new window detection
* the above two are both controled by parameters in web.xml
*
* Since it's very lightweight, so the network latency is the only impact.
*
* here are the list of will-pass-in parameters (these will replace the param in this whole
* pattern:
* viewIdLength view Id length (characters),
* loopbackIdParam loopback Id param name,
* loopbackId loopback Id,
* loopbackIdParamMatchExpr loopback Id match expression,
* windowModeIdParam window mode param name,
* windowModeParamMatchExpr window mode match expression,
* clientWindowIdParam client window Id param name,
* clientWindowIdParamMatchExpr client window Id match expression,
* windowId window Id,
* initPageLaunch initPageLaunch,
* enableNewWindowDetect whether we want to enable new window detection
* jsessionId session Id that needs to be appended to the redirect URL
* enablePPRNav whether we want to enable PPR Navigation
*
*/
var id = null;
var query = null;
var href = document.location.href;
var hashIndex = href.indexOf("#");
var hash = null;
/* process the hash part of the url, split the url */
if (hashIndex > 0)
{
hash = href.substring(hashIndex + 1);
/* only analyze hash when pprNav is on (bug 8832771) */
if (false && hash && hash.length > 0)
{
hash = decodeURIComponent(hash);
if (hash.charAt(0) == "@")
{
query = hash.substring(1);
}
else
{
var state = hash.split("@");
id = state[0];
query = state[1];
}
}
href = href.substring(0, hashIndex);
}
/* process the query part */
var queryIndex = href.indexOf("?");
if (queryIndex > 0)
{
/* only when pprNav is on, we take in the query from the hash portion */
query = (query || (id && id.length>0))? query: href.substring(queryIndex);
href = href.substring(0, queryIndex);
}
var jsessionIndex = href.indexOf(';');
if (jsessionIndex > 0)
{
href = href.substring(0, jsessionIndex);
}
/* we will replace the viewId only when pprNav is turned on (bug 8832771) */
if (false)
{
if (id != null && id.length > 0)
{
href = href.substring(0, href.length - 11) + id;
}
}
var isSet = false;
if (query == null || query.length == 0)
{
query = "?";
}
else if (query.indexOf("_afrLoop=") >= 0)
{
isSet = true;
query = query.replace(/_afrLoop=[^&]*/, "_afrLoop=39279593944826");
}
else
{
query += "&";
}
if (!isSet)
{
query = query += "_afrLoop=39279593944826";
}
/* below is the new window detection logic */
var initWindowName = "_afr_init_"; // temporary window name set to a new window
var windowName = window.name;
// if the window name is "_afr_init_", treat it as redirect case of a new window
if ((true) && (!windowName || windowName==initWindowName ||
windowName!="null"))
{
/* append the _afrWindowMode param */
var windowMode;
if (true)
{
/* this is the initial page launch case,
also this could be that we couldn't detect the real windowId from the server side */
windowMode=0;
}
else if ((href.indexOf("/__ADFvDlg__") > 0) || (query.indexOf("__ADFvDlg__") >= 0))
{
/* this is the dialog case */
windowMode=1;
}
else
{
/* this is the ctrl-N case */
windowMode=2;
}
if (query.indexOf("_afrWindowMode=") >= 0)
{
query = query.replace(/_afrWindowMode=[^&]*/, "_afrWindowMode="+windowMode);
}
else
{
query = query += "&_afrWindowMode="+windowMode;
}
/* append the _afrWindowId param */
var clientWindowId;
/* in case we couldn't detect the windowId from the server side */
if (!windowName || windowName == initWindowName)
{
clientWindowId = "null";
// set window name to an initial name so we can figure out whether a page is loaded from
// cache when doing Ctrl+N with IE
window.name = initWindowName;
}
else
{
clientWindowId = windowName;
}
if (query.indexOf("_afrWindowId=") >= 0)
{
query = query.replace(/_afrWindowId=\w*/, "_afrWindowId="+clientWindowId);
}
else
{
query = query += "&_afrWindowId="+clientWindowId;
}
}
var sess = "";
if (sess.length > 0)
href += sess;
/* if pprNav is on, then the hash portion should have already been processed */
if ((false) || (hash == null))
document.location.replace(href + query);
else
document.location.replace(href + query + "#" + hash);
</script>
</head>
</html>
回答1:
The right way to crawl ADF pages is to pass in URL a parameter
*domain.com*?org.apache.myfaces.trinidad.outputMode=webcrawler
to all the GET requests from the script. Keep in mind that when you switch to crawler mode, the pages will look different since it is not meant for human consumption, but it should contain all the raw details you would care about to crawl.
Although, this is an old question and the OP might have long moved on to better things, thought of answering this here to help anybody else hitting the same problem.
来源:https://stackoverflow.com/questions/53995849/how-to-bypass-oracle-adf-loopback-script-for-scripting-website-using-php-curl-li