htmlagilitypack and dynamic content issue

两盒软妹~` 提交于 2019-11-26 13:08:23

问题


I want to create a web scrapper application and i want to do it with webbrowser control, htmlagilitypack and xpath.

right now i managed to create xpath generator(I used webbrowser for this purpose), which works fine, but sometimes I cannot grab dynamically (via javascript or ajax) generated content. Also I found out that when webbrowser control(actually IE browser) generates some extra tags like \"tbody\", while again htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);` doesn\'t see it.

another note. I found out that following code actually grabs current webpage source, but I couldn\'t supply with it the htmlagilitypack `(mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;`

Can you please help me with it?


回答1:


I just spent hours trying to get HtmlAgilityPack to render some ajax dynamic content from a webpage and I was going from one useless post to another until I found this one.

The answer is hidden in a comment under the initial post and I thought I should straighten it out.

This is the method that I used initially and didn't work:

private void LoadTraditionalWay(String url)
{
    WebRequest myWebRequest = WebRequest.Create(url);
    WebResponse myWebResponse = myWebRequest.GetResponse();
    Stream ReceiveStream = myWebResponse.GetResponseStream();
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
    TextReader reader = new StreamReader(ReceiveStream, encode);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(reader);
    reader.Close();
}

WebRequest will not render or execute the ajax queries that render the missing content.

This is the solution that worked:

private void LoadHtmlWithBrowser(String url)
{
    webBrowser1.ScriptErrorsSuppressed = true;
    webBrowser1.Navigate(url);

    waitTillLoad(this.webBrowser1);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; 
    StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); 
    doc.Load(sr);
}

private void waitTillLoad(WebBrowser webBrControl)
{
    WebBrowserReadyState loadStatus;
    int waittime = 100000;
    int counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
        {
            break;
        }
        counter++;
    }

    counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
        {
            break;
        }
        counter++;
    }
}

The idea is to load using the WebBrowser which is capable of rendering the ajax content and then wait till the page has fully rendered before then using the Microsoft.mshtml library to re-parse the HTML into the agility pack.

This was the only way I could get access to the dynamic data.

Hope it helps someone




回答2:


Would Selenium do the trick. As far as I am aware it creates instances of browser engines.. sort of and should allow js to be executed and allow you to get the result of the manipulated DOM.




回答3:


Use HTML Agility pack document's following method.

htmlAgilityPackDocument.LoadHtml(this.browser.DocumentText);

OR

if (this.browser.Document.GetElementsByTagName("html")[0] != null)
    _htmlAgilityPackDocument.LoadHtml(this.browser.Document.GetElementsByTagName("html")[0].OuterHtml);


来源:https://stackoverflow.com/questions/10169484/htmlagilitypack-and-dynamic-content-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!