I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify t
After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span
element belonging to the jix_channels_count
class.
Another option is SgmlReader.
You tagged your question with regex
- I wholeheartedly advice you not taking this direction.
The suggested approach (with SgmlReader) goes more or less like so:
var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.None,
CaseFolding = CaseFolding.ToLower,
InputStream = sr
};
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader);
Now you can just use LINQ to XML to (recursively or otherwise) find the span
element with an attribute class
whose value equals jix_channels_count
and read the value of that element.
Parsing html page with regexes is wrong. Still if you know the exact structure of a single html line, you can use regex without thinking about the line as an html code.
Assuming that the number always is within the brackets and the span with jix_channels_count class:
Match match = Regex.Match(htmlLine, @"(\<span[^>]*class=""jix_channels_count[^>]*\>\()([^)]+)(\))", RegexOptions.IgnoreCase);
if (match.Success)
{
string number = match.Groups[2].Value;
}