Provisional solution
This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete <Name>…</Name>
entries and also a starting <Name>
without the matching </Name>
on the same line. Frankly, I'm not even sure how to start tackling that scenario.
The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full <Name>…</Name>
version. I'm leaving that as an exercise for the reader (beware the <
in the character class — it would need to become a /
).
script.sed
/<Name>/! b
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
b
}
/<Name>/,/<\/Name>/{
# Handle up to 4 lines to the end-name tag
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
: l2
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l2
}
The first line 'skips' processing of lines not containing <Name>
(they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b
to jump to the end of processing.
The new section is the /<Name>/,/<\/Name>/
code. This looks for <Name>
on its own, and concatenates up to 4 lines until a </Name>
is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2
in place of l1
, the remainder is exactly the same as in the initial offering — sed
regexes already accommodate newlines.
This is heavy-duty sed
scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.
data
A slightly extended data file.
<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name>
Jim
</Name>
<Name> Jason
Bourne </Name>
<Name>
Jason
Bourne
</Name>
<Name> Elijah </Name>
<Name>
Dennis
</Name>
<Name> Elijah
Wood </Name>
<Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name>
<Name>Dennis The
Menace</Name>
<Name> Jason </Name>
to
<Name> XXXXX </Name>
2. (see no space)
<Name>Jim</Name>
to
<Name>XXX</Name>
3.
<!--Name Jason /-->
to
<!--Name XXXXX /-->`
4.
<!--Name Jas /-->
to
<!--Name XXX /-->
starting tag, value and closing tag can all come in different line
5.
<Name>Jim
</Name>
to
<Name>XXX
</Name>
6.
<Name>
Jim
</Name>
to
<Name>
XXX
</Name>
7.
<!--Name
Jim
/-->
to
<!--Name
XXX
/-->
8.
<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
No claims are made that the data
file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like <Name Value />
are converted into XML comments <!--Name Value /-->
. The mapping actually isn't crucial; the opening part doesn't match <Name>
(and the tail doesn't match </Name>
) so they'd not be processed anyway.
Output
$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> XXXXX
</Name>
<Name>
XXX</Name>
<Name>
XXX
</Name>
<Name> XXXXX
XXXXXX </Name>
<Name>
XXXXX
XXXXXX
</Name>
<Name> XXXXXX </Name>
<Name>
XXXXXX
</Name>
<Name> XXXXXX
XXXX </Name>
<Name> XXXXXX
XXX XXXXXX </Name>
<Name>XXXXXX
XXXX</Name>
<Name>XXXXXX XXX
XXXXXX</Name>
<Name> XXXXX </Name>
to
<Name> XXXXX </Name>
2. (see no space)
<Name>XXX</Name>
to
<Name>XXX</Name>
3.
<!--Name Jason /-->
to
<!--Name XXXXX /-->`
4.
<!--Name Jas /-->
to
<!--Name XXX /-->
starting tag, value and closing tag can all come in different line
5.
<Name>XXX
</Name>
to
<Name>XXX
</Name>
6.
<Name>
XXX
</Name>
to
<Name>
XXX
</Name>
7.
<!--Name
Jim
/-->
to
<!--Name
XXX
/-->
8.
<Name> XXXXX </Name> <Name> XXXXXXX </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> XXXXX XXXXXXX </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
$
Initial offering
A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:
script.sed
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
}
That is pretty contorted, to be polite about it. It looks for <Name>
followed by zero or more spaces. That can be followed by \(X[X[[:space:]]*\)\{0,1\}
, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1
in the replacement. Then there's a single character that isn't an X
, <
or space, followed by zero or more any characters, zero or more spaces, and </Name>
. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1
and the conditional branch t l1
. All that operates only on a line with both <Name>
and </Name>
.
data
<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> Elijah </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
Output
$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> XXXXXX </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
$
Note the replacement part way through the end. That line is going to cause headaches for anything more.
I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the </Name>
is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.