How to cancel Task await after a timeout period

后端 未结 3 390
情深已故
情深已故 2020-11-22 03:33

I am using this method to instantiate a web browser programmatically, navigate to a url and return a result when the document has completed.

How would I be able to s

相关标签:
3条回答
  • 2020-11-22 04:10

    Updated: the latest version of the WebBrowser-based console web scrapper can be found on Github.

    Updated: Adding a pool of WebBrowser objects for multiple parallel downloads.

    Do you have an example of how to do this in a console app by any chance? Also I don't think webBrowser can be a class variable because I am running the whole thing in a parallell for each, iterating thousands of URLs

    Below is an implementation of more or less generic WebBrowser-based web scrapper, which works as console application. It's a consolidation of some of my previous WebBrowser-related efforts, including the code referenced in the question:

    • Capturing an image of the web page with opacity

    • Loading a page with dynamic AJAX content

    • Creating an STA message loop thread for WebBrowser

    • Loading a set of URLs, one after another

    • Printing a set of URLs with WebBrowser

    • Web page UI automation

    A few points:

    • Reusable MessageLoopApartment class is used to start and run a WinForms STA thread with its own message pump. It can be used from a console application, as below. This class exposes a TPL Task Scheduler (FromCurrentSynchronizationContext) and a set of Task.Factory.StartNew wrappers to use this task scheduler.

    • This makes async/await a great tool for running WebBrowser navigation tasks on that separate STA thread. This way, a WebBrowser object gets created, navigated and destroyed on that thread. Although, MessageLoopApartment is not tied up to WebBrowser specifically.

    • It's important to enable HTML5 rendering using Browser Feature Control, as otherwise the WebBrowser obejcts runs in IE7 emulation mode by default. That's what SetFeatureBrowserEmulation does below.

    • It may not always be possible to determine when a web page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. Yet we can get quite close, by handling DocumentCompleted event first, then polling the page's current HTML snapshot for changes and checking the WebBrowser.IsBusy property. That's what NavigateAsync does below.

    • A time-out logic is present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource and CreateLinkedTokenSource).

    using Microsoft.Win32;
    using System;
    using System.Threading;
    using System.Threading.Tasks;
    using System.Windows.Forms;
    
    namespace Console_22239357
    {
        class Program
        {
            // by Noseratio - https://stackoverflow.com/a/22262976/1768303
    
            // main logic
            static async Task ScrapSitesAsync(string[] urls, CancellationToken token)
            {
                using (var apartment = new MessageLoopApartment())
                {
                    // create WebBrowser inside MessageLoopApartment
                    var webBrowser = apartment.Invoke(() => new WebBrowser());
                    try
                    {
                        foreach (var url in urls)
                        {
                            Console.WriteLine("URL:\n" + url);
    
                            // cancel in 30s or when the main token is signalled
                            var navigationCts = CancellationTokenSource.CreateLinkedTokenSource(token);
                            navigationCts.CancelAfter((int)TimeSpan.FromSeconds(30).TotalMilliseconds);
                            var navigationToken = navigationCts.Token;
    
                            // run the navigation task inside MessageLoopApartment
                            string html = await apartment.Run(() =>
                                webBrowser.NavigateAsync(url, navigationToken), navigationToken);
    
                            Console.WriteLine("HTML:\n" + html);
                        }
                    }
                    finally
                    {
                        // dispose of WebBrowser inside MessageLoopApartment
                        apartment.Invoke(() => webBrowser.Dispose());
                    }
                }
            }
    
            // entry point
            static void Main(string[] args)
            {
                try
                {
                    WebBrowserExt.SetFeatureBrowserEmulation(); // enable HTML5
    
                    var cts = new CancellationTokenSource((int)TimeSpan.FromMinutes(3).TotalMilliseconds);
    
                    var task = ScrapSitesAsync(
                        new[] { "http://example.com", "http://example.org", "http://example.net" },
                        cts.Token);
    
                    task.Wait();
    
                    Console.WriteLine("Press Enter to exit...");
                    Console.ReadLine();
                }
                catch (Exception ex)
                {
                    while (ex is AggregateException && ex.InnerException != null)
                        ex = ex.InnerException;
                    Console.WriteLine(ex.Message);
                    Environment.Exit(-1);
                }
            }
        }
    
        /// <summary>
        /// WebBrowserExt - WebBrowser extensions
        /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
        /// </summary>
        public static class WebBrowserExt
        {
            const int POLL_DELAY = 500;
    
            // navigate and download 
            public static async Task<string> NavigateAsync(this WebBrowser webBrowser, string url, CancellationToken token)
            {
                // navigate and await DocumentCompleted
                var tcs = new TaskCompletionSource<bool>();
                WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
                    tcs.TrySetResult(true);
    
                using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
                {
                    webBrowser.DocumentCompleted += handler;
                    try
                    {
                        webBrowser.Navigate(url);
                        await tcs.Task; // wait for DocumentCompleted
                    }
                    finally
                    {
                        webBrowser.DocumentCompleted -= handler;
                    }
                }
    
                // get the root element
                var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];
    
                // poll the current HTML for changes asynchronosly
                var html = documentElement.OuterHtml;
                while (true)
                {
                    // wait asynchronously, this will throw if cancellation requested
                    await Task.Delay(POLL_DELAY, token);
    
                    // continue polling if the WebBrowser is still busy
                    if (webBrowser.IsBusy)
                        continue;
    
                    var htmlNow = documentElement.OuterHtml;
                    if (html == htmlNow)
                        break; // no changes detected, end the poll loop
    
                    html = htmlNow;
                }
    
                // consider the page fully rendered 
                token.ThrowIfCancellationRequested();
                return html;
            }
    
            // enable HTML5 (assuming we're running IE10+)
            // more info: https://stackoverflow.com/a/18333982/1768303
            public static void SetFeatureBrowserEmulation()
            {
                if (System.ComponentModel.LicenseManager.UsageMode != System.ComponentModel.LicenseUsageMode.Runtime)
                    return;
                var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
                Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
                    appName, 10000, RegistryValueKind.DWord);
            }
        }
    
        /// <summary>
        /// MessageLoopApartment
        /// STA thread with message pump for serial execution of tasks
        /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
        /// </summary>
        public class MessageLoopApartment : IDisposable
        {
            Thread _thread; // the STA thread
    
            TaskScheduler _taskScheduler; // the STA thread's task scheduler
    
            public TaskScheduler TaskScheduler { get { return _taskScheduler; } }
    
            /// <summary>MessageLoopApartment constructor</summary>
            public MessageLoopApartment()
            {
                var tcs = new TaskCompletionSource<TaskScheduler>();
    
                // start an STA thread and gets a task scheduler
                _thread = new Thread(startArg =>
                {
                    EventHandler idleHandler = null;
    
                    idleHandler = (s, e) =>
                    {
                        // handle Application.Idle just once
                        Application.Idle -= idleHandler;
                        // return the task scheduler
                        tcs.SetResult(TaskScheduler.FromCurrentSynchronizationContext());
                    };
    
                    // handle Application.Idle just once
                    // to make sure we're inside the message loop
                    // and SynchronizationContext has been correctly installed
                    Application.Idle += idleHandler;
                    Application.Run();
                });
    
                _thread.SetApartmentState(ApartmentState.STA);
                _thread.IsBackground = true;
                _thread.Start();
                _taskScheduler = tcs.Task.Result;
            }
    
            /// <summary>shutdown the STA thread</summary>
            public void Dispose()
            {
                if (_taskScheduler != null)
                {
                    var taskScheduler = _taskScheduler;
                    _taskScheduler = null;
    
                    // execute Application.ExitThread() on the STA thread
                    Task.Factory.StartNew(
                        () => Application.ExitThread(),
                        CancellationToken.None,
                        TaskCreationOptions.None,
                        taskScheduler).Wait();
    
                    _thread.Join();
                    _thread = null;
                }
            }
    
            /// <summary>Task.Factory.StartNew wrappers</summary>
            public void Invoke(Action action)
            {
                Task.Factory.StartNew(action,
                    CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Wait();
            }
    
            public TResult Invoke<TResult>(Func<TResult> action)
            {
                return Task.Factory.StartNew(action,
                    CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Result;
            }
    
            public Task Run(Action action, CancellationToken token)
            {
                return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
            }
    
            public Task<TResult> Run<TResult>(Func<TResult> action, CancellationToken token)
            {
                return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
            }
    
            public Task Run(Func<Task> action, CancellationToken token)
            {
                return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
            }
    
            public Task<TResult> Run<TResult>(Func<Task<TResult>> action, CancellationToken token)
            {
                return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-22 04:17

    I suspect running a processing loop on another thread will not work out well, since WebBrowser is a UI component that hosts an ActiveX control.

    When you're writing TAP over EAP wrappers, I recommend using extension methods to keep the code clean:

    public static Task<string> NavigateAsync(this WebBrowser @this, string url)
    {
      var tcs = new TaskCompletionSource<string>();
      WebBrowserDocumentCompletedEventHandler subscription = null;
      subscription = (_, args) =>
      {
        @this.DocumentCompleted -= subscription;
        tcs.TrySetResult(args.Url.ToString());
      };
      @this.DocumentCompleted += subscription;
      @this.Navigate(url);
      return tcs.Task;
    }
    

    Now your code can easily apply a timeout:

    async Task<string> GetUrlAsync(string url)
    {
      using (var wb = new WebBrowser())
      {
        var navigate = wb.NavigateAsync(url);
        var timeout = Task.Delay(TimeSpan.FromSeconds(5));
        var completed = await Task.WhenAny(navigate, timeout);
        if (completed == navigate)
          return await navigate;
        return null;
      }
    }
    

    which can be consumed as such:

    private async Task<Uri> GetFinalUrlAsync(PortalMerchant portalMerchant)
    {
      SetBrowserFeatureControl();
      if (string.IsNullOrEmpty(portalMerchant.Url))
        return null;
      var result = await GetUrlAsync(portalMerchant.Url);
      if (!String.IsNullOrEmpty(result))
        return new Uri(result);
      throw new Exception("Parsing Failed");
    }
    
    0 讨论(0)
  • 2020-11-22 04:20

    I'm trying to take benefit from Noseratio's solution as well as following advices from Stephen Cleary.

    Here is the code I updated to include in the code from Stephen the code from Noseratio regarding the AJAX tip.

    First part: the Task NavigateAsync advised by Stephen

    public static Task<string> NavigateAsync(this WebBrowser @this, string url)
    {
      var tcs = new TaskCompletionSource<string>();
      WebBrowserDocumentCompletedEventHandler subscription = null;
      subscription = (_, args) =>
      {
        @this.DocumentCompleted -= subscription;
        tcs.TrySetResult(args.Url.ToString());
      };
      @this.DocumentCompleted += subscription;
      @this.Navigate(url);
      return tcs.Task;
    }
    

    Second part: a new Task NavAjaxAsync to run the tip for AJAX (based on Noseratio's code)

    public static async Task<string> NavAjaxAsync(this WebBrowser @this)
    {
      // get the root element
      var documentElement = @this.Document.GetElementsByTagName("html")[0];
    
      // poll the current HTML for changes asynchronosly
      var html = documentElement.OuterHtml;
    
      while (true)
      {
        // wait asynchronously
        await Task.Delay(POLL_DELAY);
    
        // continue polling if the WebBrowser is still busy
        if (webBrowser.IsBusy)
          continue;
    
        var htmlNow = documentElement.OuterHtml;
        if (html == htmlNow)
          break; // no changes detected, end the poll loop
    
        html = htmlNow;
      }
    
      return @this.Document.Url.ToString();
    }
    

    Third part: a new Task NavAndAjaxAsync to get the navigation and the AJAX

    public static async Task NavAndAjaxAsync(this WebBrowser @this, string url)
    {
      await @this.NavigateAsync(url);
      await @this.NavAjaxAsync();
    }
    

    Fourth and last part: the updated Task GetUrlAsync from Stephen with Noseratio's code for AJAX

    async Task<string> GetUrlAsync(string url)
    {
      using (var wb = new WebBrowser())
      {
        var navigate = wb.NavAndAjaxAsync(url);
        var timeout = Task.Delay(TimeSpan.FromSeconds(5));
        var completed = await Task.WhenAny(navigate, timeout);
        if (completed == navigate)
          return await navigate;
        return null;
      }
    }
    

    I'd like to know if this is the right approach.

    0 讨论(0)
提交回复
热议问题