html agility pack url scraping— getting full html link

前端未结

关注

 2  1515

佛祖请我去吃肉 2021-01-21 20:41

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to

2条回答

面向向阳花 (楼主)

2021-01-21 20:51

You can check the HREF value if it's relative URL or absolute. Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.

static void Main(string[] args)
    {
        List linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet list = new HashSet();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

Function to convert Relative URL to Absolute

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}

0 讨论(0)

查看其它2个回答