Download all PDF files from crawled links

问题

While running code it says that ProductListPage is null and after dropping an error does not proceed forward.

Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a is found or something else?

Here is my current code:

HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a");
foreach (HtmlNode src in ProductListPage)
{
    htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);

    HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='row padt6 padb4']//a");
    if (LinkTester != null)
    {
        foreach (var dllink in LinkTester)
        {
            string LinkURL = dllink.Attributes["href"].Value;
            Console.WriteLine(LinkURL);

            string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
            var DLClient = new WebClient();

            DLClient.DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
        }
    }
}

EDIT:

Code seems to work without VPN connection, however it does not work with VPN. I have alternative made using Python and BeautifulSoup and it works regardless of VPN connection. Is there any idea why C# and htmlAgilityPack does not do the trick?

EDIT2:

I have noticed that on VPN connection page is loaded with a slight delay. First page is getting loaded and then comes the content.

回答1:

Make sure you have access to the site (firewall or other app not allowing access perhaps).

When i ran your code, both Visual Basic and .Net, I can get to the subsites and even look up the Pdf links. I would recommend using the debugger to

Check if you can access the site in your browser.
If you can access the site, use Debugger to see what InnerHtml you have for htmlDoc.DocumentNode
If you get the data, copy it to Notepad and see if the tags are there. You should have a complete HTML Doc.
For proxy server, add info to the load call. https://stackoverflow.com/a/12099646/1390548

回答2:

After about 2 months of searching and reading finally there is a solution. Adding this to app.config worked for me without the need for any code changes:

<system.net>
   <defaultProxy useDefaultCredentials="true" />
</system.net>

so my app.config looks like this now:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <startup> 
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.7.2" />
    </startup>
  <system.net>
    <defaultProxy useDefaultCredentials="true" />
  </system.net>
</configuration>

Please give original answer credits for this! https://stackoverflow.com/a/40900485/7202022

来源：https://stackoverflow.com/questions/59628313/download-all-pdf-files-from-crawled-links

标签

web-scraping

web-crawler

html-agility-pack