HtmlAgilityPack WebGet.Load gives error “Object reference not set to an instance of an object”

前端 未结 1 886
逝去的感伤
逝去的感伤 2020-12-18 11:12

I am on a project about getting new car prices from dealers websites. I can fetch most web sites html. But when I try to load one of them WebGet.Load(url) method gives

1条回答
  •  隐瞒了意图╮
    2020-12-18 11:49

    The actual problem is in HtmlAgilityPack internals. The page not working has this meta content type: where charset=8859-9 seems to be incorrent. The HAL internals tries to get an appropriate encoding for this string by using something like Encoding.GetEncoding("8859-9") and this throws an error (I think the actual encoding should be iso-8859-9).

    Actually all you need is to tell the HAL not to read encoding for the HtmlDocument (just HtmlDocument.OptionReadEncoding = true), but this seems to be impossible with HtmlWeb.Load (setting HtmlWeb.AutoDetectEncoding isn't work here). So, the workaround could be in a manual reading of the url (the simplest way):

    var document = new HtmlDocument();
    document.OptionReadEncoding = false;
    
    var url = 
       new Uri("http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx");
    var request = (HttpWebRequest)WebRequest.Create(url);
    request.Method = "GET";
    using (var response = (HttpWebResponse)request.GetResponse())
    {
        using (var stream = response.GetResponseStream())
        {
            document.Load(stream, Encoding.GetEncoding("iso-8859-9"));
        }
    }
    

    This works, and successfully parses the page.

    EDIT: @:Simon Mourier: yes, it raises NullReferenceException because it catches ArgumentException and sets _declaredencoding = null there. And then _declaredencoding.WindowsCodePage line throws the null reference.

    here is a code block from the HtmlDocument.cs, ReadDocumentEncoding method:

    try
    {
        _declaredencoding = Encoding.GetEncoding(charset);
    }
    catch (ArgumentException)
    {
        _declaredencoding = null;
    }
    if (_onlyDetectEncoding)
    {
        throw new EncodingFoundException(_declaredencoding);
    }
    
    if (_streamencoding != null)
    {
        if (_declaredencoding.WindowsCodePage != _streamencoding.WindowsCodePage)
        {
            AddError(
                HtmlParseErrorCode.CharsetMismatch,
                _line, _lineposition,
                _index, node.OuterHtml,
                "Encoding mismatch between StreamEncoding: " +
                _streamencoding.WebName + " and DeclaredEncoding: " +
                _declaredencoding.WebName);
        }
    }
    

    And here is my stack trace:

    System.NullReferenceException was unhandled
      Message=Object reference not set to an instance of an object.
      Source=HtmlAgilityPack
      StackTrace:
           at HtmlAgilityPack.HtmlDocument.ReadDocumentEncoding(HtmlNode node) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1916
           at HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32 index, Boolean close) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1805
           at HtmlAgilityPack.HtmlDocument.Parse() in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1468
           at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 769
           at HtmlAgilityPack.HtmlDocument.Load(Stream stream, Boolean detectEncodingFromByteOrderMarks) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 597
           at HtmlAgilityPack.HtmlWeb.Get(Uri uri, String method, String path, HtmlDocument doc, IWebProxy proxy, ICredentials creds) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1515
           at HtmlAgilityPack.HtmlWeb.LoadUrl(Uri uri, String method, WebProxy proxy, NetworkCredential creds) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1563
           at HtmlAgilityPack.HtmlWeb.Load(String url, String method) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1152
           at HtmlAgilityPack.HtmlWeb.Load(String url) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1107
           at test.console.Program.Main(String[] args) in W:\Projects\Me\test.console\test.console\Program.cs:line 54
           at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
           at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
           at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
           at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
           at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
           at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
           at System.Threading.ThreadHelper.ThreadStart()
      InnerException: 
    

    0 讨论(0)
提交回复
热议问题