问题
I am trying to scrape a website written in php to extract some information from a particular table. Here is the scenario.
On the landing page there is a form that can take queries from user and based on that search for the results. If I ignore those fields and click on "Submit" it will produce the whole result (Which is what I am interested in). Before I did not know about HTTPWebRequest class and I was simply passing the URL to Htmlweb.load(URL) method in HtmlAgilityPack library and obviously was not the way to go.
Then I searched for HTTPWebRequest and I found an example which is like this
Dim cookies As New CookieContainer
Dim postData As String = "postData obtained using live httpheaders pluging in firefox"
Dim encoding As New UTF8Encoding
Dim byteData As Byte() = encoding.GetBytes(postData)
Dim postRequest As HttpWebRequest = DirectCast(WebRequest.Create("URL"), HttpWebRequest)
postRequest.Method = "POST"
postRequest.KeepAlive = True
postRequest.CookieContainer = cookies
postRequest.ContentType = "application/x-www-form-urlencoded"
postRequest.ContentLength = byteData.Length
postRequest.Referer = "Referer Page"
postRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)"
Dim postreqstream As Stream = postRequest.GetRequestStream()
postreqstream.Write(byteData, 0, byteData.Length)
postreqstream.Close()
Dim postresponse As HttpWebResponse
postresponse = DirectCast(postRequest.GetResponse(), HttpWebResponse)
cookies.Add(postresponse.Cookies)
Dim postreqreader As New StreamReader(postresponse.GetResponseStream())
Dim thepage As String = postreqreader.ReadToEnd
Now when I output thepage variable to a browser in vb form, I can see the page that I want (Containing tables). At this point I simply passed the URL of that page to htmlagilitypack like so
Dim web As New HtmlAgilityPack.HtmlWeb()
Dim htmlDoc As HtmlAgilityPack.HtmlDocument = web.Load("URL")
Dim tabletag As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//table")
Dim tablenode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//table[@summary='List of services']")
If Not tabletag Is Nothing Then
Console.WriteLine("YES")
End If
But tabletag variable is nothing. I want to know where I am going wrong? Also is there anyway to get the URL straight from httpwebrespone so I can pass into web.load method ?
thank you
回答1:
If the content you want is built through JavaScript, you can't run JavaScript through HtmlAgilityPack Load method or any simple URL loader client like WebRequest. They don't process and they don't interact with webpages like browsers do. Otherwise you could just load directly from your stream like this:
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument
htmlDoc.Load(postresponse.GetResponseStream())
First suggestion: You can load the form page URL in the WebBrowser and then manage to fill the form and click the submit button programatically accessing the HTMLDocument via DOM. More info in posts like this and this.
Second suggestion: WebBrowser gets a little tricky to handle when you don't want to have a visual event-driven control in your screen or in worst scenario, when you want to manipulate webpages in background threads. In this case, you can use the STAThread solution here and here or use one of called headless browsers like Selenium or HtmlUnit, WatiN and do the same using their DOM access.
来源:https://stackoverflow.com/questions/11491170/unexpected-behaviour-while-using-httpwebrequest-on-a-form-to-obtain-a-table-for