Download all PDF files from crawled links

倾然丶 夕夏残阳落幕 提交于 2020-01-16 08:27:33

问题


While running code it says that ProductListPage is null and after dropping an error does not proceed forward.

Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a is found or something else?

Here is my current code:

HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a");
foreach (HtmlNode src in ProductListPage)
{
    htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);

    HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='row padt6 padb4']//a");
    if (LinkTester != null)
    {
        foreach (var dllink in LinkTester)
        {
            string LinkURL = dllink.Attributes["href"].Value;
            Console.WriteLine(LinkURL);

            string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
            var DLClient = new WebClient();

            DLClient.DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
        }
    }
}

EDIT:

Code seems to work without VPN connection, however it does not work with VPN. I have alternative made using Python and BeautifulSoup and it works regardless of VPN connection. Is there any idea why C# and htmlAgilityPack does not do the trick?


EDIT2:

I have noticed that on VPN connection page is loaded with a slight delay. First page is getting loaded and then comes the content.


回答1:


Make sure you have access to the site (firewall or other app not allowing access perhaps).

When i ran your code, both Visual Basic and .Net, I can get to the subsites and even look up the Pdf links. I would recommend using the debugger to

  1. Check if you can access the site in your browser.
  2. If you can access the site, use Debugger to see what InnerHtml you have for htmlDoc.DocumentNode
  3. If you get the data, copy it to Notepad and see if the tags are there. You should have a complete HTML Doc.
  4. For proxy server, add info to the load call. https://stackoverflow.com/a/12099646/1390548



回答2:


After about 2 months of searching and reading finally there is a solution. Adding this to app.config worked for me without the need for any code changes:

<system.net>
   <defaultProxy useDefaultCredentials="true" />
</system.net>

so my app.config looks like this now:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <startup> 
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.7.2" />
    </startup>
  <system.net>
    <defaultProxy useDefaultCredentials="true" />
  </system.net>
</configuration>

Please give original answer credits for this! https://stackoverflow.com/a/40900485/7202022



来源:https://stackoverflow.com/questions/59628313/download-all-pdf-files-from-crawled-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!