RegEx to remove carriage returns between

tags

前端 未结 7 542
臣服心动
臣服心动 2021-01-07 07:57

I\'ve stumped myself trying to figure out how to remove carriage returns that occur between

tags. (Technically I need to replace them with spaces, not

相关标签:
7条回答
  • 2021-01-07 08:36

    This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    while ($html !~ /\G\Z/cg) {
      if ($html =~ /\G(<p[^>]*>)/cg) {
        $output .= $1;
        $in_p ++;
      } elsif ($html =~ m[\G(</p>)]cg) {
        $output .= $1;
        $in_p --; # Woe unto anyone who doesn't provide a closing tag.
        # Tag soup parsers are good for this because they can generate an
        # "artificial" end to the P when they find an element that can't contain
        # a P, or the end of the enclosing element. We're not smart enough for that.
      } elsif ($html =~ /\G([^<]+)/cg) {
        my $text = $1;
        $text =~ s/\s*\n\s*/ /g if $in_p;
        $output .= $text;
      } elsif ($html =~ /\G(<)/cg) {
        $output .= $1;
      } else {
        die "Can't happen, but not having an else is scary!";
      }
    }
    
    0 讨论(0)
  • 2021-01-07 08:38

    Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).

    If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p> (possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.

    0 讨论(0)
  • 2021-01-07 08:40

    A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)

    For reference:

    • Why shouldn't I parse XML or XHTML with a regex?
    • How can I parse HTML in my language of choice?
    0 讨论(0)
  • 2021-01-07 08:45

    The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.

    0 讨论(0)
  • 2021-01-07 08:47

    I think it should work like this:

    1. get whole paragraph (text between <p> and </p>) from teh body
    2. create copy of this paragraph
    3. in copy replace \n with space
    4. in the body repace paragraph with modified copy

    You can do it using regex, but I think simple character scanning can be used.

    Some code in Python:

    rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)
    
    def get_paragraphs(body):
        paragraphs = []
        body_copy = body
        rxx = rx.search(body_copy)
        while rxx:
            paragraphs.append(rxx.group(1))
            body_copy = body_copy[rxx.end(1):]
            rxx = rx.search(body_copy)
        return paragraphs
    
    def replace_paragraphs(body):
        paragraphs = get_paragraphs(body)
        for par in paragraphs:
            par_new = par.replace('\n', ' ')
            body = body.replace(par, par_new)
        return body
    
    def main():
        new_body = replace_paragraphs(BODY)
        print(new_body)
    
    main() 
    
    0 讨论(0)
  • 2021-01-07 08:48

    Just use '\n' but ensure that you enable multiple line regex.

    0 讨论(0)
提交回复
热议问题