RegEx matching HTML tags and extracting text

后端 未结 5 1337
生来不讨喜
生来不讨喜 2020-12-17 01:45

I have a string of test like this:

hey

I want to use a RegEx to modify the text between the \"customtag\

相关标签:
5条回答
  • 2020-12-17 01:49

    If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:

    <customtag>[^<>]*</customtag>
    
    0 讨论(0)
  • 2020-12-17 01:56

    I wouldn't use regex either for this, but if you must this expression should work: <customtag>(.+?)</customtag>

    0 讨论(0)
  • 2020-12-17 01:57

    I'd chew my own leg off before using a regular expression to parse and alter HTML.

    Use XSL or DOM.


    Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.

    What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.

    Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.

    XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.

    Here are a couple of articles on how to use XSL with C#:

    • http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
    • http://www.csharphelp.com/archives/archive78.html

    Here are a couple of articles on how to use DOM with C#:

    • http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
    • http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx

    Here's a .NET library that assists DOM and XSL operations on HTML:

    • http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
    0 讨论(0)
  • 2020-12-17 01:59

    Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)

    You can find 3 simple examples here:

    http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/

    0 讨论(0)
  • 2020-12-17 02:13
    //This is to replace all HTML Text
    
    var re = new RegExp("<[^>]*>", "g");
    
    var x2 = Content.replace(re,"");
    
    //This is to replace all &nbsp;
    
    var x3 = x2.replace(/\u00a0/g,'');
    
    0 讨论(0)
提交回复
热议问题