How to extract html links from html file in C#?

前端 未结 3 1670
耶瑟儿~
耶瑟儿~ 2020-12-21 05:52

Can anyone help me by explaining how to extract urls/links from HTML File in C#

相关标签:
3条回答
  • 2020-12-21 06:09

    You can use an HTQL COM object and query the page using query: <a>:href

    HTQLCOMLib.HtqlControl h = new HTQLCOMLib.HtqlControl();
    string page = "<html><body><a href='test1.html'>test1</a><a href='test2.html'>test2</a> </body></html>";
    h.setSourceData(page, page.Length);
    h.setQuery("<a>: href ");
    for (h.moveFirst(); 0 == h.isEOF(); h.moveNext() )
    {
         MessageBox.Show(h.getValueByIndex(1));
    }
    

    It will show messages of:

    test1.html

    test2.html

    0 讨论(0)
  • 2020-12-21 06:13

    Use HTMLAgility Pack...

        private List<string> ParseLinks(string html)
        {
            var doc = new HtmlDocument(); 
            doc.LoadHtml(html);
            var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
            return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(r => r.Attributes.ToList().ConvertAll(i => i.Value)).SelectMany(j => j).ToList();
        }
    

    It works for me.

    0 讨论(0)
  • 2020-12-21 06:20

    look at Html Agility Pack

    HtmlDocument doc = new HtmlDocument(); 
    doc.Load("file.htm");  
    foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) 
    {
        HtmlAttribute att = link.Attributes["href"];
        yourList.Add(att.Value)  
    }  
    doc.Save("file.htm");
    
    0 讨论(0)
提交回复
热议问题