What is the best way to get the HTML for HTML Agiligy Pack to process?

爷,独闯天下 提交于 2019-12-31 05:17:27

问题


I can't seem to get the HTML from a few sites, but can from many others. Here are 2 sites I am having issues with:

https://www.rei.com

https://www.homedepot.com

I am building an app that will get meta tag info from a URL that the user enters. Once I get the HTML the code, I process it using HTML Agility pack and it works perfectly. The problem is with getting the HTML from various websites.

I have tried various ways to get the HTML (HtmlWeb, HttpWebRequest and others) all with setting the user-agent (same agent tag as chrome), headers, cookies and autoredirect, gzip-ing and seems like every combination. All verified by looking as Fiddler, but I can't seem to figure out why I can't get the HTML from some sites, they just timeout, when I can pull up that same URL in my browser just fine. The headers that I send look the same as Fiddler. Does anyone know what is causing the URL's to not return the HTML/data? Or does anyone have a NuGet package or framework that handles all the nuances of getting the HTML page/document, whether the website is SSL, gzip'ed, requires cookies, redirects, etc?

Going into this project I thought the hardest part would be processing the HTML not getting it so any help would be appreciated.

UPDATE 1:

I tried but I just can't seem to get it to work... I must be missing something easy... here is an updated example with some of the suggested changes.

https://dotnetfiddle.net/tQyav7

I had to comment out the ServerCertificateValidationCallback on dotnetfiddle because it was throwing an error there, but it isn't not on my dev box. I also had to set the timeout to only 5 seconds... I have it at 20 on my dev box. Any help would be appreciated.


回答1:


This is your helper class, refactored to support most the web responses that a HttpWebResponse can handle.

A note: never do this kind of setups if you don't have Option Explicit and Option Strict set to True: you'll never get it right. Automatic inference is not your friend here (well, actually never is; you really need to know what objects you're dealing with).

What has been modified and what is important handle:

  • Tls handling: extended support for Tls 1.1, Tls 1.2 and the maximum protocol version that the current framework can handle:

    System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()
    
  • WebRequest.ServicePoint.Expect100Continue = False: you never want this kind of response, unless you're ready to comply. But it's never necessary.

  • [AutomaticDecompression][1] is required, unless you want to handle the GZip or Deflate streams manually. It's almost never required (only if you want to analyze the original stream before decompressing it).

  • The CookieContainer is rebuilt every time. This has not been modified, but you could store a static object and reuse the Cookies with each request: some sites may set the cookies when the Tls handshake is performed and redirect to a login page. A WebRequest can be used to POST authentication parameters (except captchas), but you need to preserve the Cookies, otherwise any further request won't be authenticated.

  • The Response Stream ReadToEnd() method is also as left as is, but you should modify it to read a buffer. It would allow to show the download progress, for example, and also to cancel the operation, if required.

  • Important: the UserAgent cannot be set to a recent version of any existing Browser. Some web sites, when detect that a User Agent supports the HSTS protocol, will activate it and wait for interaction. WebRequest knows nothing about HSTS and will timeout. I set the UserAgent to Internet Explorer 11. It works fine with all sites.

  • Http Redirection is set to automatic, but sometimes it's necessary to follow it manually. This could improve the reliablility of this procedure. You could, for example, forbid redirections to out-of-scope destinations. Or a HTTP protocol change that you don't support.

A suggestion: this class would benefit from the async version of the HttpWebRequest methods: you'ld be able to issue a number of concurrent requests instead of waiting each and all of them to complete synchronously.
Only a few modifications are required to turn this class into an async version.

This class should now support most Html pages that don't use Scripts to build the content asynchronously.
As already described in comments, a Lazy HttpClient can handle some (not all) of these pages, but it requires a completely different setup.

Imports System
Imports System.IO
Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates
Imports System.Text

Public Class WebRequestHelper
    Private m_ResponseUri As Uri
    Private m_StatusCode As HttpStatusCode
    Private m_StatusDescription As String
    Private m_ContentSize As Long
    Private m_WebException As WebExceptionStatus
    Public Property SiteCookies As CookieContainer
    Public Property UserAgent As String = "Mozilla / 5.0(Windows NT 6.1; WOW32; Trident / 7.0; rv: 11.0) like Gecko"
    Public Property Timeout As Integer = 30000
    Public ReadOnly Property ContentSize As Long
        Get
            Return m_ContentSize
        End Get
    End Property

    Public ReadOnly Property ResponseUri As Uri
        Get
            Return m_ResponseUri
        End Get
    End Property

    Public ReadOnly Property StatusCode As Integer
        Get
            Return m_StatusCode
        End Get
    End Property

    Public ReadOnly Property StatusDescription As String
        Get
            Return m_StatusDescription
        End Get
    End Property

    Public ReadOnly Property WebException As Integer
        Get
            Return m_WebException
        End Get
    End Property


    Sub New()
        SiteCookies = New CookieContainer()
    End Sub

    Public Function GetSiteResponse(ByVal siteUri As Uri) As String
        Dim response As String = String.Empty

        ServicePointManager.DefaultConnectionLimit = 50
        Dim maxFWValue As SecurityProtocolType = System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 Or SecurityProtocolType.Tls12 Or maxFWValue
        ServicePointManager.ServerCertificateValidationCallback = AddressOf TlsValidationCallback

        Dim Http As HttpWebRequest = WebRequest.CreateHttp(siteUri.ToString)
        With Http
            .Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            .AllowAutoRedirect = True
            .AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate
            .CookieContainer = Me.SiteCookies
            .Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate")
            .Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.7")
            .Headers.Add(HttpRequestHeader.CacheControl, "no-cache")
            .KeepAlive = True
            .MaximumAutomaticRedirections = 50
            .ServicePoint.Expect100Continue = False
            .ServicePoint.MaxIdleTime = Me.Timeout
            .Timeout = Me.Timeout
            .UserAgent = Me.UserAgent
        End With

        Try
            Using webResponse As HttpWebResponse = DirectCast(Http.GetResponse, HttpWebResponse)
                Me.m_ResponseUri = webResponse.ResponseUri
                Me.m_StatusCode = webResponse.StatusCode
                Me.m_StatusDescription = webResponse.StatusDescription
                Dim contentLength As String = webResponse.Headers.Get("Content-Length")
                Me.m_ContentSize = If(String.IsNullOrEmpty(contentLength), 0, Convert.ToInt64(contentLength))

                Using responseStream As Stream = webResponse.GetResponseStream()
                    If webResponse.StatusCode = HttpStatusCode.OK Then
                        Dim reader As StreamReader = New StreamReader(responseStream, Encoding.Default)
                        Me.m_ContentSize = webResponse.ContentLength
                        response = reader.ReadToEnd()
                        Me.m_ContentSize = If(Me.m_ContentSize = -1, response.Length, Me.m_ContentSize)
                    End If
                End Using
            End Using
        Catch exW As WebException
            If exW.Response IsNot Nothing Then
                Me.m_StatusCode = CType(exW.Response, HttpWebResponse).StatusCode
            End If
            Me.m_StatusDescription = "WebException: " & exW.Message
            Me.m_WebException = exW.Status
        End Try
        Return response
    End Function

    Private Function TlsValidationCallback(sender As Object, CACert As X509Certificate, CAChain As X509Chain, SslPolicyErrors As SslPolicyErrors) As Boolean
        If SslPolicyErrors = SslPolicyErrors.None Then Return True
        Dim Certificate As New X509Certificate2(CACert)

        CAChain.Build(Certificate)
        For Each CACStatus As X509ChainStatus In CAChain.ChainStatus
            If (CACStatus.Status <> X509ChainStatusFlags.NoError) And
                (CACStatus.Status <> X509ChainStatusFlags.UntrustedRoot) Then
                Return False
            End If
        Next
        Return True
    End Function

End Class


来源:https://stackoverflow.com/questions/55565710/what-is-the-best-way-to-get-the-html-for-html-agiligy-pack-to-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!