问题
I can't seem to get the HTML from a few sites, but can from many others. Here are 2 sites I am having issues with:
https://www.rei.com
https://www.homedepot.com
I am building an app that will get meta tag info from a URL that the user enters. Once I get the HTML the code, I process it using HTML Agility pack and it works perfectly. The problem is with getting the HTML from various websites.
I have tried various ways to get the HTML (HtmlWeb
, HttpWebRequest
and others) all with setting the user-agent (same agent tag as chrome), headers, cookies and autoredirect, gzip-ing and seems like every combination. All verified by looking as Fiddler, but I can't seem to figure out why I can't get the HTML from some sites, they just timeout, when I can pull up that same URL in my browser just fine. The headers that I send look the same as Fiddler.
Does anyone know what is causing the URL's to not return the HTML/data? Or does anyone have a NuGet package or framework that handles all the nuances of getting the HTML page/document, whether the website is SSL, gzip'ed, requires cookies, redirects, etc?
Going into this project I thought the hardest part would be processing the HTML not getting it so any help would be appreciated.
UPDATE 1:
I tried but I just can't seem to get it to work... I must be missing something easy... here is an updated example with some of the suggested changes.
https://dotnetfiddle.net/tQyav7
I had to comment out the ServerCertificateValidationCallback on dotnetfiddle because it was throwing an error there, but it isn't not on my dev box. I also had to set the timeout to only 5 seconds... I have it at 20 on my dev box. Any help would be appreciated.
回答1:
This is your helper class, refactored to support most the web responses that a HttpWebResponse can handle.
A note: never do this kind of setups if you don't have Option Explicit
and Option Strict
set to True
: you'll never get it right. Automatic inference is not your friend here (well, actually never is; you really need to know what objects you're dealing with).
What has been modified and what is important handle:
Tls handling: extended support for Tls 1.1, Tls 1.2 and the maximum protocol version that the current framework can handle:
System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()
WebRequest.ServicePoint.Expect100Continue = False
: you never want this kind of response, unless you're ready to comply. But it's never necessary.[AutomaticDecompression][1]
is required, unless you want to handle the GZip or Deflate streams manually. It's almost never required (only if you want to analyze the original stream before decompressing it).The
CookieContainer
is rebuilt every time. This has not been modified, but you could store a static object and reuse the Cookies with each request: some sites may set the cookies when the Tls handshake is performed and redirect to a login page. A WebRequest can be used to POST authentication parameters (except captchas), but you need to preserve the Cookies, otherwise any further request won't be authenticated.The Response Stream
ReadToEnd()
method is also as left as is, but you should modify it to read a buffer. It would allow to show the download progress, for example, and also to cancel the operation, if required.Important: the UserAgent cannot be set to a recent version of any existing Browser. Some web sites, when detect that a User Agent supports the HSTS protocol, will activate it and wait for interaction. WebRequest knows nothing about
HSTS
and will timeout. I set the UserAgent to Internet Explorer 11. It works fine with all sites.- Http Redirection is set to automatic, but sometimes it's necessary to follow it manually. This could improve the reliablility of this procedure. You could, for example, forbid redirections to out-of-scope destinations. Or a HTTP protocol change that you don't support.
A suggestion: this class would benefit from the async
version of the HttpWebRequest methods: you'ld be able to issue a number of concurrent requests instead of waiting each and all of them to complete synchronously.
Only a few modifications are required to turn this class into an async version.
This class should now support most Html pages that don't use Scripts to build the content asynchronously.
As already described in comments, a Lazy HttpClient can handle some (not all) of these pages, but it requires a completely different setup.
Imports System
Imports System.IO
Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates
Imports System.Text
Public Class WebRequestHelper
Private m_ResponseUri As Uri
Private m_StatusCode As HttpStatusCode
Private m_StatusDescription As String
Private m_ContentSize As Long
Private m_WebException As WebExceptionStatus
Public Property SiteCookies As CookieContainer
Public Property UserAgent As String = "Mozilla / 5.0(Windows NT 6.1; WOW32; Trident / 7.0; rv: 11.0) like Gecko"
Public Property Timeout As Integer = 30000
Public ReadOnly Property ContentSize As Long
Get
Return m_ContentSize
End Get
End Property
Public ReadOnly Property ResponseUri As Uri
Get
Return m_ResponseUri
End Get
End Property
Public ReadOnly Property StatusCode As Integer
Get
Return m_StatusCode
End Get
End Property
Public ReadOnly Property StatusDescription As String
Get
Return m_StatusDescription
End Get
End Property
Public ReadOnly Property WebException As Integer
Get
Return m_WebException
End Get
End Property
Sub New()
SiteCookies = New CookieContainer()
End Sub
Public Function GetSiteResponse(ByVal siteUri As Uri) As String
Dim response As String = String.Empty
ServicePointManager.DefaultConnectionLimit = 50
Dim maxFWValue As SecurityProtocolType = System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 Or SecurityProtocolType.Tls12 Or maxFWValue
ServicePointManager.ServerCertificateValidationCallback = AddressOf TlsValidationCallback
Dim Http As HttpWebRequest = WebRequest.CreateHttp(siteUri.ToString)
With Http
.Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.AllowAutoRedirect = True
.AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate
.CookieContainer = Me.SiteCookies
.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate")
.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.7")
.Headers.Add(HttpRequestHeader.CacheControl, "no-cache")
.KeepAlive = True
.MaximumAutomaticRedirections = 50
.ServicePoint.Expect100Continue = False
.ServicePoint.MaxIdleTime = Me.Timeout
.Timeout = Me.Timeout
.UserAgent = Me.UserAgent
End With
Try
Using webResponse As HttpWebResponse = DirectCast(Http.GetResponse, HttpWebResponse)
Me.m_ResponseUri = webResponse.ResponseUri
Me.m_StatusCode = webResponse.StatusCode
Me.m_StatusDescription = webResponse.StatusDescription
Dim contentLength As String = webResponse.Headers.Get("Content-Length")
Me.m_ContentSize = If(String.IsNullOrEmpty(contentLength), 0, Convert.ToInt64(contentLength))
Using responseStream As Stream = webResponse.GetResponseStream()
If webResponse.StatusCode = HttpStatusCode.OK Then
Dim reader As StreamReader = New StreamReader(responseStream, Encoding.Default)
Me.m_ContentSize = webResponse.ContentLength
response = reader.ReadToEnd()
Me.m_ContentSize = If(Me.m_ContentSize = -1, response.Length, Me.m_ContentSize)
End If
End Using
End Using
Catch exW As WebException
If exW.Response IsNot Nothing Then
Me.m_StatusCode = CType(exW.Response, HttpWebResponse).StatusCode
End If
Me.m_StatusDescription = "WebException: " & exW.Message
Me.m_WebException = exW.Status
End Try
Return response
End Function
Private Function TlsValidationCallback(sender As Object, CACert As X509Certificate, CAChain As X509Chain, SslPolicyErrors As SslPolicyErrors) As Boolean
If SslPolicyErrors = SslPolicyErrors.None Then Return True
Dim Certificate As New X509Certificate2(CACert)
CAChain.Build(Certificate)
For Each CACStatus As X509ChainStatus In CAChain.ChainStatus
If (CACStatus.Status <> X509ChainStatusFlags.NoError) And
(CACStatus.Status <> X509ChainStatusFlags.UntrustedRoot) Then
Return False
End If
Next
Return True
End Function
End Class
来源:https://stackoverflow.com/questions/55565710/what-is-the-best-way-to-get-the-html-for-html-agiligy-pack-to-process