Alright with the way below it is extracting only referring url like this
the extraction code :
foreach (HtmlNode link in hdDoc.DocumentNode.SelectNod
I can do it with checking the url whether containing http and if not add the domain value
That's what you should do. Html Agility Pack has nothing to help you with this:
var url = new Uri(
new Uri(baseUrl).GetLeftPart(UriPartial.Path),
link.Attributes["href"].Value)
);
Assuming you have the original url, you can combine the parsed url something like this:
// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");
// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'
// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'
// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'
// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'
Note the use of AbsoluteUri
and not relying on ToString()
because ToString
decodes the URL (to make it more "human-readable"), which is not typically what you want.