How to use HTML Agility pack

后端 未结 7 1781
闹比i
闹比i 2020-11-21 04:30

How do I use the HTML Agility Pack?

My XHTML document is not completely valid. That\'s why I wanted to use it. How do I use it in my project? My project is in C#.

相关标签:
7条回答
  • 2020-11-21 05:02

    I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.

    • HtmlAgilityPack Article Series
    • Introduction To The HtmlAgilityPack Library
    • Easily extracting links from a snippet of html with HtmlAgilityPack

    The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.

    0 讨论(0)
  • 2020-11-21 05:06

    Main HTMLAgilityPack related code is as follows

    using System;
    using System.Net;
    using System.Web;
    using System.Web.Services;
    using System.Web.Script.Services;
    using System.Text.RegularExpressions;
    using HtmlAgilityPack;
    
    namespace GetMetaData
    {
        /// <summary>
        /// Summary description for MetaDataWebService
        /// </summary>
        [WebService(Namespace = "http://tempuri.org/")]
        [WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
        [System.ComponentModel.ToolboxItem(false)]
        // To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
        [System.Web.Script.Services.ScriptService]
        public class MetaDataWebService: System.Web.Services.WebService
        {
            [WebMethod]
            [ScriptMethod(UseHttpGet = false)]
            public MetaData GetMetaData(string url)
            {
                MetaData objMetaData = new MetaData();
    
                //Get Title
                WebClient client = new WebClient();
                string sourceUrl = client.DownloadString(url);
    
                objMetaData.PageTitle = Regex.Match(sourceUrl, @
                "\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
    
                //Method to get Meta Tags
                objMetaData.MetaDescription = GetMetaDescription(url);
                return objMetaData;
            }
    
            private string GetMetaDescription(string url)
            {
                string description = string.Empty;
    
                //Get Meta Tags
                var webGet = new HtmlWeb();
                var document = webGet.Load(url);
                var metaTags = document.DocumentNode.SelectNodes("//meta");
    
                if (metaTags != null)
                {
                    foreach(var tag in metaTags)
                    {
                        if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
                        {
                            description = tag.Attributes["content"].Value;
                        }
                    }
                } 
                else
                {
                    description = string.Empty;
                }
                return description;
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-21 05:11

    try this

    string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
    
    private string ParseHmlBody(string html)
            {
                string body = string.Empty;
                try
                {
                    var htmlDoc = new HtmlDocument();
                    htmlDoc.LoadHtml(html);
                    var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
                    body = htmlBody.OuterHtml;
                }
                catch (Exception ex)
                {
    
                    dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
                }
                return body;
            }
    
    0 讨论(0)
  • 2020-11-21 05:16

    HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp

    To parse

    <h2>
      <a href="">Jack</a>
    </h2>
    <ul>
      <li class="tel">
        <a href="">81 75 53 60</a>
      </li>
    </ul>
    <h2>
      <a href="">Roy</a>
    </h2>
    <ul>
      <li class="tel">
        <a href="">44 52 16 87</a>
      </li>
    </ul>
    

    I did this:

    string url = "http://website.com";
    var Webget = new HtmlWeb();
    var doc = Webget.Load(url);
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
    {
      names.Add(node.ChildNodes[0].InnerHtml);
    }
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class='tel']//a"))
    {
      phones.Add(node.ChildNodes[0].InnerHtml);
    }
    
    0 讨论(0)
  • 2020-11-21 05:18

    Getting Started - HTML Agility Pack

    // From File
    var doc = new HtmlDocument();
    doc.Load(filePath);
    
    // From String
    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    
    // From Web
    var url = "http://html-agility-pack.net/";
    var web = new HtmlWeb();
    var doc = web.Load(url);
    
    0 讨论(0)
  • 2020-11-21 05:21
        public string HtmlAgi(string url, string key)
        {
    
            var Webget = new HtmlWeb();
            var doc = Webget.Load(url);
            HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[@name='{0}']", key));
    
            if (ourNode != null)
            {
    
    
                    return ourNode.GetAttributeValue("content", "");
    
            }
            else
            {
                return "not fount";
            }
    
        }
    
    0 讨论(0)
提交回复
热议问题