I\'m looking for a method that will allow me to get the title of a webpage and store it as a string.
However all the solutions I have found so far involve downloadin
As the <title>
tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title>
tag, or the </head>
tag and then stop, but you'll still need to download (at least a portion of) the file.
This can be accomplished with HttpWebRequest
/HttpWebResponse
and reading in data from the response stream until we've either read in a <title></title>
block, or the </head>
tag. I added the </head>
tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).
The following should be able to accomplish this task:
string title = "";
try {
HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
using (Stream stream = response.GetResponseStream()) {
// compiled regex to check for <title></title> block
Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
int bytesToRead = 8092;
byte[] buffer = new byte[bytesToRead];
string contents = "";
int length = 0;
while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
// convert the byte-array to a string and add it to the rest of the
// contents that have been downloaded so far
contents += Encoding.UTF8.GetString(buffer, 0, length);
Match m = titleCheck.Match(contents);
if (m.Success) {
// we found a <title></title> match =]
title = m.Groups[1].Value.ToString();
break;
} else if (contents.Contains("</head>")) {
// reached end of head-block; no title found =[
break;
}
}
}
} catch (Exception e) {
Console.WriteLine(e);
}
UPDATE: Updated the original source-example to use a compiled Regex
and a using
statement for the Stream
for better efficiency and maintainability.
A simpler way to handle this would be to download it, then split:
using System;
using System.Net.Http;
private async void getSite(string url)
{
HttpClient hc = new HttpClient();
HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
string source = await response.Content.ReadAsStringAsync();
//process the source here
}
To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags