问题
While running code it says that ProductListPage
is null and after dropping an error does not proceed forward.
Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a
is found or something else?
Here is my current code:
HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a");
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='row padt6 padb4']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
DLClient.DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
}
}
}
EDIT:
Code seems to work without VPN connection, however it does not work with VPN. I have alternative made using Python and BeautifulSoup and it works regardless of VPN connection. Is there any idea why C# and htmlAgilityPack does not do the trick?
EDIT2:
I have noticed that on VPN connection page is loaded with a slight delay. First page is getting loaded and then comes the content.
回答1:
Make sure you have access to the site (firewall or other app not allowing access perhaps).
When i ran your code, both Visual Basic and .Net, I can get to the subsites and even look up the Pdf links. I would recommend using the debugger to
- Check if you can access the site in your browser.
- If you can access the site, use Debugger to see what InnerHtml you have for
htmlDoc.DocumentNode
- If you get the data, copy it to Notepad and see if the tags are there. You should have a complete HTML Doc.
- For proxy server, add info to the load call. https://stackoverflow.com/a/12099646/1390548
回答2:
After about 2 months of searching and reading finally there is a solution. Adding this to app.config
worked for me without the need for any code changes:
<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>
so my app.config
looks like this now:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<startup>
<supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.7.2" />
</startup>
<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>
</configuration>
Please give original answer credits for this! https://stackoverflow.com/a/40900485/7202022
来源:https://stackoverflow.com/questions/59628313/download-all-pdf-files-from-crawled-links