How to extract full url with HtmlAgilityPack - C#

后端 未结 2 1939
悲&欢浪女
悲&欢浪女 2020-12-06 02:58

Alright with the way below it is extracting only referring url like this

the extraction code :

foreach (HtmlNode link in hdDoc.DocumentNode.SelectNod         


        
相关标签:
2条回答
  • 2020-12-06 03:23

    I can do it with checking the url whether containing http and if not add the domain value

    That's what you should do. Html Agility Pack has nothing to help you with this:

    var url = new Uri(
        new Uri(baseUrl).GetLeftPart(UriPartial.Path), 
        link.Attributes["href"].Value)
    ); 
    
    0 讨论(0)
  • 2020-12-06 03:44

    Assuming you have the original url, you can combine the parsed url something like this:

    // The address of the page you crawled
    var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");
    
    // root relative
    var url = new Uri(baseUrl, "/Login.aspx");
    Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'
    
    // relative
    url = new Uri(baseUrl, "../foo.aspx?q=1");
    Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'
    
    // absolute
    url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
    Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'
    
    // other...
    url = new Uri(baseUrl, "javascript:void(0)");
    Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'
    

    Note the use of AbsoluteUri and not relying on ToString() because ToString decodes the URL (to make it more "human-readable"), which is not typically what you want.

    0 讨论(0)
提交回复
热议问题