html agility pack url scraping— getting full html link

前端 未结 2 1515
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-21 20:41

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to

2条回答
  •  面向向阳花
    2021-01-21 20:51

    You can check the HREF value if it's relative URL or absolute. Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.

    static void Main(string[] args)
        {
            List linksToVisit = ParseLinks("https://www.facebook.com");
        }
    
    public static List ParseLinks(string urlToCrawl)
        {
    
            WebClient webClient = new WebClient();
    
            byte[] data = webClient.DownloadData(urlToCrawl);
            string download = Encoding.ASCII.GetString(data);
    
            HashSet list = new HashSet();
    
            var doc = new HtmlDocument();
            doc.LoadHtml(download);
            HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");
    
                foreach (var n in nodes)
                {
                    string href = n.Attributes["href"].Value;
                    list.Add(GetAbsoluteUrlString(urlToCrawl, href));
                }
            return list.ToList();
        }
    

    Function to convert Relative URL to Absolute

    static string GetAbsoluteUrlString(string baseUrl, string url)
    {
        var uri = new Uri(url, UriKind.RelativeOrAbsolute);
        if (!uri.IsAbsoluteUri)
            uri = new Uri(new Uri(baseUrl), uri);
        return uri.ToString();
    }
    

提交回复
热议问题