How do I extract info from a webpage?

后端未结

关注

 2  688

I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify t

相关标签:

2条回答

说谎

2021-01-25 05:29

After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

Another option is SgmlReader.

You tagged your question with regex - I wholeheartedly advice you not taking this direction.

The suggested approach (with SgmlReader) goes more or less like so:

var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();                
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
             {
                 DocType = "HTML",
                 WhitespaceHandling = WhitespaceHandling.None,
                 CaseFolding = CaseFolding.ToLower,
                 InputStream = sr
             };
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader);

Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

0 讨论(0)

再見小時候

2021-01-25 05:40
Parsing html page with regexes is wrong. Still if you know the exact structure of a single html line, you can use regex without thinking about the line as an html code.

Assuming that the number always is within the brackets and the span with jix_channels_count class:
```
Match match = Regex.Match(htmlLine, @"(\<span[^>]*class=""jix_channels_count[^>]*\>\()([^)]+)(\))", RegexOptions.IgnoreCase);
if (match.Success)
{
    string number = match.Groups[2].Value;
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...