How do I extract info from a webpage?

后端 未结 2 684
再見小時候
再見小時候 2021-01-25 04:55

I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify t

2条回答
  •  说谎
    说谎 (楼主)
    2021-01-25 05:29

    After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

    Another option is SgmlReader.

    You tagged your question with regex - I wholeheartedly advice you not taking this direction.

    The suggested approach (with SgmlReader) goes more or less like so:

    var url = "www.that-website.com/foo/";
    var myRequest = (HttpWebRequest)WebRequest.Create(url);
    myRequest.Method = "GET";
    WebResponse myResponse = myRequest.GetResponse();                
    var responseStream = myResponse.GetResponseStream();
    var sr = new StreamReader(responseStream, Encoding.Default);
    var reader = new SgmlReader
                 {
                     DocType = "HTML",
                     WhitespaceHandling = WhitespaceHandling.None,
                     CaseFolding = CaseFolding.ToLower,
                     InputStream = sr
                 };
    var xmlDoc = new XmlDocument();
    xmlDoc.Load(reader);
    var nodeReader = new XmlNodeReader(xmlDoc);
    XElement xml = XElement.Load(nodeReader); 
    

    Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

提交回复
热议问题