How do I extract info from a webpage?

后端 未结 2 682
再見小時候
再見小時候 2021-01-25 04:55

I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify t

相关标签:
2条回答
  • 2021-01-25 05:29

    After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

    Another option is SgmlReader.

    You tagged your question with regex - I wholeheartedly advice you not taking this direction.

    The suggested approach (with SgmlReader) goes more or less like so:

    var url = "www.that-website.com/foo/";
    var myRequest = (HttpWebRequest)WebRequest.Create(url);
    myRequest.Method = "GET";
    WebResponse myResponse = myRequest.GetResponse();                
    var responseStream = myResponse.GetResponseStream();
    var sr = new StreamReader(responseStream, Encoding.Default);
    var reader = new SgmlReader
                 {
                     DocType = "HTML",
                     WhitespaceHandling = WhitespaceHandling.None,
                     CaseFolding = CaseFolding.ToLower,
                     InputStream = sr
                 };
    var xmlDoc = new XmlDocument();
    xmlDoc.Load(reader);
    var nodeReader = new XmlNodeReader(xmlDoc);
    XElement xml = XElement.Load(nodeReader); 
    

    Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

    0 讨论(0)
  • 2021-01-25 05:40

    Parsing html page with regexes is wrong. Still if you know the exact structure of a single html line, you can use regex without thinking about the line as an html code.

    Assuming that the number always is within the brackets and the span with jix_channels_count class:

    Match match = Regex.Match(htmlLine, @"(\<span[^>]*class=""jix_channels_count[^>]*\>\()([^)]+)(\))", RegexOptions.IgnoreCase);
    if (match.Success)
    {
        string number = match.Groups[2].Value;
    }
    
    0 讨论(0)
提交回复
热议问题