I am trying to extract the content of a webpage A. Using groovy I\'ve tried the following
......
String urlStr = \"url-of-webpage-A\"
String pageText = urlSt
In Java you can use URL.openConnection()
to get a HttpURLConnection (you'll need to cast). On this you can call setInstanceFollowRedirects(false).
Then you can use getResponseCode() and see if HTTP_MOVED_PERM (301), HTTP_MOVED_TEMP (302) or HTTP_SEE_OTHER (303). They all indicate redirection.
If you need to know where you're being redirected to, then you can use getHeaderField("Location") to get the location header.
In groovy, you could do what Joachim suggests by doing:
String location = "url-of-webpage-A"
boolean wasRedirected = false
String pageContent = null
while( location ) {
new URL( location ).openConnection().with { con ->
// We'll do redirects ourselves
con.instanceFollowRedirects = false
// Get the response code, and the location to jump to (in case of a redirect)
location = con.getHeaderField( "Location" )
if( !wasRedirected && location ) {
wasRedirected = true
}
// Read the HTML and close the inputstream
pageContent = con.inputStream.withReader { it.text }
}
}
println "wasRedirected:$wasRedirected contentLength:${pageContent.length()}"
If you don't want to be redirected, and want the contents of the first page, you simply need to do:
String location = "url-of-webpage-A"
String pageContent = new URL( location ).openConnection().with { con ->
// We'll do redirects ourselves
con.instanceFollowRedirects = false
// Get the location to jump to (in case of a redirect)
location = con.getHeaderField( "Location" )
// Read the HTML and close the inputstream
con.inputStream.withReader { it.text }
}
if( location ) {
println "Page wanted to redirect to $location"
}
println "Content was:"
println pageContent