Unexpected behaviour while using Httpwebrequest on a form to obtain a table for scrapping

回眸只為那壹抹淺笑 提交于 2019-12-13 03:57:37

问题


I am trying to scrape a website written in php to extract some information from a particular table. Here is the scenario.

On the landing page there is a form that can take queries from user and based on that search for the results. If I ignore those fields and click on "Submit" it will produce the whole result (Which is what I am interested in). Before I did not know about HTTPWebRequest class and I was simply passing the URL to Htmlweb.load(URL) method in HtmlAgilityPack library and obviously was not the way to go.

Then I searched for HTTPWebRequest and I found an example which is like this

    Dim cookies As New CookieContainer
    Dim postData As String = "postData obtained using live httpheaders pluging in firefox"
    Dim encoding As New UTF8Encoding
    Dim byteData As Byte() = encoding.GetBytes(postData)


    Dim postRequest As HttpWebRequest = DirectCast(WebRequest.Create("URL"), HttpWebRequest)
    postRequest.Method = "POST"
    postRequest.KeepAlive = True
    postRequest.CookieContainer = cookies
    postRequest.ContentType = "application/x-www-form-urlencoded"
    postRequest.ContentLength = byteData.Length
    postRequest.Referer = "Referer Page"
    postRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)"

    Dim postreqstream As Stream = postRequest.GetRequestStream()
    postreqstream.Write(byteData, 0, byteData.Length)
    postreqstream.Close()
    Dim postresponse As HttpWebResponse

    postresponse = DirectCast(postRequest.GetResponse(), HttpWebResponse)
    cookies.Add(postresponse.Cookies)
    Dim postreqreader As New StreamReader(postresponse.GetResponseStream())

    Dim thepage As String = postreqreader.ReadToEnd

Now when I output thepage variable to a browser in vb form, I can see the page that I want (Containing tables). At this point I simply passed the URL of that page to htmlagilitypack like so

    Dim web As New HtmlAgilityPack.HtmlWeb()
    Dim htmlDoc As HtmlAgilityPack.HtmlDocument = web.Load("URL")
    Dim tabletag As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//table")
    Dim tablenode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//table[@summary='List of services']")

    If Not tabletag Is Nothing Then

        Console.WriteLine("YES")

    End If

But tabletag variable is nothing. I want to know where I am going wrong? Also is there anyway to get the URL straight from httpwebrespone so I can pass into web.load method ?

thank you


回答1:


If the content you want is built through JavaScript, you can't run JavaScript through HtmlAgilityPack Load method or any simple URL loader client like WebRequest. They don't process and they don't interact with webpages like browsers do. Otherwise you could just load directly from your stream like this:

Dim htmlDoc As New HtmlAgilityPack.HtmlDocument
htmlDoc.Load(postresponse.GetResponseStream())

First suggestion: You can load the form page URL in the WebBrowser and then manage to fill the form and click the submit button programatically accessing the HTMLDocument via DOM. More info in posts like this and this.

Second suggestion: WebBrowser gets a little tricky to handle when you don't want to have a visual event-driven control in your screen or in worst scenario, when you want to manipulate webpages in background threads. In this case, you can use the STAThread solution here and here or use one of called headless browsers like Selenium or HtmlUnit, WatiN and do the same using their DOM access.



来源:https://stackoverflow.com/questions/11491170/unexpected-behaviour-while-using-httpwebrequest-on-a-form-to-obtain-a-table-for

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!