RegEx to remove carriage returns between
tags

前端未结

关注

 7  556

臣服心动

I\'ve stumped myself trying to figure out how to remove carriage returns that occur between

tags. (Technically I need to replace them with spaces, not

相关标签:

7条回答

余生分开走

2021-01-07 08:36

This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

#!/usr/bin/perl
use strict;
use warnings;

while ($html !~ /\G\Z/cg) {
  if ($html =~ /\G(<p[^>]*>)/cg) {
    $output .= $1;
    $in_p ++;
  } elsif ($html =~ m[\G(</p>)]cg) {
    $output .= $1;
    $in_p --; # Woe unto anyone who doesn't provide a closing tag.
    # Tag soup parsers are good for this because they can generate an
    # "artificial" end to the P when they find an element that can't contain
    # a P, or the end of the enclosing element. We're not smart enough for that.
  } elsif ($html =~ /\G([^<]+)/cg) {
    my $text = $1;
    $text =~ s/\s*\n\s*/ /g if $in_p;
    $output .= $text;
  } elsif ($html =~ /\G(<)/cg) {
    $output .= $1;
  } else {
    die "Can't happen, but not having an else is scary!";
  }
}

0 讨论(0)

爱一瞬间的悲伤

2021-01-07 08:38

Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).

If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p> (possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.

0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2021-01-07 08:40
A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)

For reference:
- Why shouldn't I parse XML or XHTML with a regex?
- How can I parse HTML in my language of choice?
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2021-01-07 08:45

The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.

0 讨论(0)
发布评论:

提交评论
- 加载中...

野性不改

2021-01-07 08:47

I think it should work like this:

get whole paragraph (text between <p> and </p>) from teh body
create copy of this paragraph
in copy replace \n with space
in the body repace paragraph with modified copy

You can do it using regex, but I think simple character scanning can be used.

Some code in Python:

rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)

def get_paragraphs(body):
    paragraphs = []
    body_copy = body
    rxx = rx.search(body_copy)
    while rxx:
        paragraphs.append(rxx.group(1))
        body_copy = body_copy[rxx.end(1):]
        rxx = rx.search(body_copy)
    return paragraphs

def replace_paragraphs(body):
    paragraphs = get_paragraphs(body)
    for par in paragraphs:
        par_new = par.replace('\n', ' ')
        body = body.replace(par, par_new)
    return body

def main():
    new_body = replace_paragraphs(BODY)
    print(new_body)

main()

0 讨论(0)

南旧

2021-01-07 08:48

Just use '\n' but ensure that you enable multiple line regex.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

RegEx to remove carriage returns between tags

RegEx to remove carriage returns between
tags