HTML traversal is very slow

半世苍凉 提交于 2019-12-10 11:16:57

问题


I faced that simply iterating through MSHTML elements using C# is horribly slow. Here is small example of iteration through document.all collection three times. We have blank WPF application and WebBrowser control named Browser:

public partial class MainWindow
{
    public MainWindow()
    {
        InitializeComponent();

        Browser.LoadCompleted += DocumentLoaded;
        Browser.Navigate("http://google.com");
    }

    private IHTMLElementCollection _items;

    private void DocumentLoaded(object sender, NavigationEventArgs e)
    {
        var dc = (HTMLDocument)Browser.Document;
        _items = dc.all;

        Test();
        Test();
        Test();
    }

    private void Test()
    {
        var sw = new Stopwatch();
        sw.Start();

        int i;
        for (i = 0; i < _items.length; i++)
        {
            _items.item(i);
        }

        sw.Stop();

        Debug.WriteLine("Items: {0}, Time: {1}", i, sw.Elapsed);
    }
}

The output is:

Items: 274, Time: 00:00:01.0573245
Items: 274, Time: 00:00:00.0011637
Items: 274, Time: 00:00:00.0006619

The performance difference between 1 and 2 lines is horrible. I tried to rewrite same code with unmanaged C++ and COM and got no performance issues at all, unmanaged code runs 1200 times faster. Unfortunately going unmanaged is not an option because the real project is more complex than simple iterating.

I understand that for the first time runtime creates RCW for each referenced HTML element which is COM object. But can it be THAT slow? 300 items per second with 100% core load of 3,2 GHz CPU.

Performance analysis of the code above:


回答1:


enumerate the all element collection using for each instead of document.all.item(index) (use IHTMLElementCollection::get__newEnum if you switch to C++).

Suggested reading: IE + JavaScript Performance Recommendations - Part 1




回答2:


The source of poor performance is that collection items defined as dynamic objects in the MSHTML interop assembly.

public interface IHTMLElementCollection : IEnumerable
{
    ...
    [DispId(0)]
    dynamic item(object name = Type.Missing, object index = Type.Missing);
    ...
}

If we rewrite that interface so it returns IDispatch objects then the lag will disappear.

public interface IHTMLElementCollection : IEnumerable
{
    ...
    [DispId(0)]
    [return: MarshalAs(UnmanagedType.IDispatch)]
    object item(object name = Type.Missing, object index = Type.Missing);
    ...
}

New output:

Items: 246, Time: 00:00:00.0034520
Items: 246, Time: 00:00:00.0029398
Items: 246, Time: 00:00:00.0029968


来源:https://stackoverflow.com/questions/14666302/html-traversal-is-very-slow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!