I want to transform this text (remove ) with sed, awk or perl:
{|
|-
| colspan=\"2\"|
:
This should do it:
perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt
-p
says we're doing a sed-like readline/printline loop, -0777
says each "line" is actually the whole input file, and -e
specifies the code to run (on each "line" (file)).
If your text files are too big to fit into memory (?!), you can try this:
perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt
or (slightly more readable):
perl -pe '
s!<math>.*?</math>!!g;
if ($cut) {
if (s!^.*?</math>!!) { $cut = 0 }
else { $_ = "" }
}
if (!$cut && s!<math>.*!!s) { $cut = 1 }
' dirt-math.txt
This is effectively a little state machine.
$cut
records whether we're in an unclosed <math>
tag (and so need to cut out input). If so, we check whether we were able to find/remove </math>
. If so, we're done cutting (we found a closing </math>
tag); otherwise we overwrite the "current line" with the empty string ($_ = ""
; this is the actual cutting part).
If, after this, we're not cutting (we're not using else
to handle the case where ... </math> not math text <math>
appears on a single line), we try to remove <math>...
from the input. If so, we've just seen an opening <math>
tag and need to start cutting.
This isn't quite the one-liner but it does what you're looking for. As always there are many ways of doing this. But here I am using '|' as the records separator and ':' as the field separator. That allows me to iterate over the fields in a record that contains math and only print the fields that don't contain <math></math>
.
BEGIN {RS="|";FS=":";ORS=""}
/math/ {
for (i=1;i<=NF;i++) {
if ($i ~ /math/) {print ":\n"}
else {print $i}
}
print "|";next;
}
/^\}/ {
print "}";
next;
}
{
print $0"|"
}
END {print "\n"}
If all data is so nicely formatted as in your example, then your solution is very close. I modified it only slightly
in AWK:
sub(/<math>.*/, "") {print; cut=1}
/<\/math>/ {cut=0; next}
!cut
This can also be done using ..
flip-flop(not range) operator without taking the whole file in memory and removing <math>
from the starting point like:
perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file