Get current WebBrowser DOM as HTML

前端 未结 1 547
無奈伤痛
無奈伤痛 2021-01-13 14:22

I want to use the HTML ability pack on a WebBrowser that has loaded all the things I need (It clicks a button with code to load every video on the channel) (It loads a YouTu

相关标签:
1条回答
  • 2021-01-13 14:48

    If the target website uses AJAX heavily (as Youtube does), it's hard, if not impossible, to determine when the page has finished loading and executing all dynamic scripts. But you could get close by handling window.onload event and allowing an extra second or two for non-deterministic AJAX calls. Then call webBrowser.Document.DomDocument.documentElement.outerHTML via dynamic to get the currently rendered HTML.

    Example:

    private void Form1_Load(object sender, EventArgs e)
    {
        DownloadAsync("http://www.example.com").ContinueWith(
            (task) => MessageBox.Show(task.Result),
            TaskScheduler.FromCurrentSynchronizationContext());
    }
    
    async Task<string> DownloadAsync(string url)
    {
        TaskCompletionSource<bool> onloadTcs = new TaskCompletionSource<bool>();
        WebBrowserDocumentCompletedEventHandler handler = null;
    
        handler = delegate
        {
            this.webBrowser.DocumentCompleted -= handler;
    
            // attach to subscribe to DOM onload event
            this.webBrowser.Document.Window.AttachEventHandler("onload", delegate
            {
                // each navigation has its own TaskCompletionSource
                if (onloadTcs.Task.IsCompleted)
                    return; // this should not be happening
                // signal the completion of the page loading
                onloadTcs.SetResult(true);
            });
        };
    
        // register DocumentCompleted handler
        this.webBrowser.DocumentCompleted += handler;
    
        // Navigate to url
        this.webBrowser.Navigate(url);
    
        // continue upon onload
        await onloadTcs.Task;
    
        // artificial delay for AJAX
        await Task.Delay(1000);
    
        // the document has been fully loaded, can access DOM here
        return ((dynamic)this.webBrowser.Document.DomDocument).documentElement.outerHTML;
    }
    

    [EDITED] Here's the final piece of code that helped to solve the OP's problem:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(((dynamic)this.webBrowser1.Document.DomDocument).documentElement.ou‌​terHTML); 
    
    0 讨论(0)
提交回复
热议问题