How to get webpage title without downloading all the page source

前端 未结 2 924
小鲜肉
小鲜肉 2020-12-31 23:29

I\'m looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloadin

相关标签:
2条回答
  • 2021-01-01 00:03

    As the <title> tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title> tag, or the </head> tag and then stop, but you'll still need to download (at least a portion of) the file.

    This can be accomplished with HttpWebRequest/HttpWebResponse and reading in data from the response stream until we've either read in a <title></title> block, or the </head> tag. I added the </head> tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

    The following should be able to accomplish this task:

    string title = "";
    try {
        HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
        HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
    
        using (Stream stream = response.GetResponseStream()) {
            // compiled regex to check for <title></title> block
            Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
            int bytesToRead = 8092;
            byte[] buffer = new byte[bytesToRead];
            string contents = "";
            int length = 0;
            while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
                // convert the byte-array to a string and add it to the rest of the
                // contents that have been downloaded so far
                contents += Encoding.UTF8.GetString(buffer, 0, length);
    
                Match m = titleCheck.Match(contents);
                if (m.Success) {
                    // we found a <title></title> match =]
                    title = m.Groups[1].Value.ToString();
                    break;
                } else if (contents.Contains("</head>")) {
                    // reached end of head-block; no title found =[
                    break;
                }
            }
        }
    } catch (Exception e) {
        Console.WriteLine(e);
    }
    

    UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.

    0 讨论(0)
  • 2021-01-01 00:19

    A simpler way to handle this would be to download it, then split:

        using System;
        using System.Net.Http;
    
        private async void getSite(string url)
        {
            HttpClient hc = new HttpClient();
            HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
            string source = await response.Content.ReadAsStringAsync();
    
            //process the source here
    
        }
    

    To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags

    0 讨论(0)
提交回复
热议问题