tags
I\'ve stumped myself trying to figure out how to remove carriage returns that occur between tags. (Technically I need to replace them with spaces, not
This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)
#!/usr/bin/perl
use strict;
use warnings;
while ($html !~ /\G\Z/cg) {
if ($html =~ /\G(<p[^>]*>)/cg) {
$output .= $1;
$in_p ++;
} elsif ($html =~ m[\G(</p>)]cg) {
$output .= $1;
$in_p --; # Woe unto anyone who doesn't provide a closing tag.
# Tag soup parsers are good for this because they can generate an
# "artificial" end to the P when they find an element that can't contain
# a P, or the end of the enclosing element. We're not smart enough for that.
} elsif ($html =~ /\G([^<]+)/cg) {
my $text = $1;
$text =~ s/\s*\n\s*/ /g if $in_p;
$output .= $text;
} elsif ($html =~ /\G(<)/cg) {
$output .= $1;
} else {
die "Can't happen, but not having an else is scary!";
}
}
Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).
If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p>
(possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.
A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)
For reference:
The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.
I think it should work like this:
You can do it using regex, but I think simple character scanning can be used.
Some code in Python:
rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)
def get_paragraphs(body):
paragraphs = []
body_copy = body
rxx = rx.search(body_copy)
while rxx:
paragraphs.append(rxx.group(1))
body_copy = body_copy[rxx.end(1):]
rxx = rx.search(body_copy)
return paragraphs
def replace_paragraphs(body):
paragraphs = get_paragraphs(body)
for par in paragraphs:
par_new = par.replace('\n', ' ')
body = body.replace(par, par_new)
return body
def main():
new_body = replace_paragraphs(BODY)
print(new_body)
main()
Just use '\n
' but ensure that you enable multiple line regex.