How to cut html tag from very large multiline text file with content with use perl, sed or awk?

后端 未结 4 1825
伪装坚强ぢ
伪装坚强ぢ 2021-01-28 08:10

I want to transform this text (remove .*?) with sed, awk or perl:

{|
|-
| colspan=\"2\"|
: 
[\\underbrace{\\col         


        
相关标签:
4条回答
  • 2021-01-28 08:54

    This should do it:

    perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt
    

    -p says we're doing a sed-like readline/printline loop, -0777 says each "line" is actually the whole input file, and -e specifies the code to run (on each "line" (file)).


    If your text files are too big to fit into memory (?!), you can try this:

    perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt
    

    or (slightly more readable):

    perl -pe '
        s!<math>.*?</math>!!g;
        if ($cut) {
            if (s!^.*?</math>!!) { $cut = 0 }
            else { $_ = "" }
        }
        if (!$cut && s!<math>.*!!s) { $cut = 1 }
    ' dirt-math.txt
    

    This is effectively a little state machine.

    $cut records whether we're in an unclosed <math> tag (and so need to cut out input). If so, we check whether we were able to find/remove </math>. If so, we're done cutting (we found a closing </math> tag); otherwise we overwrite the "current line" with the empty string ($_ = ""; this is the actual cutting part).

    If, after this, we're not cutting (we're not using else to handle the case where ... </math> not math text <math> appears on a single line), we try to remove <math>... from the input. If so, we've just seen an opening <math> tag and need to start cutting.

    0 讨论(0)
  • 2021-01-28 09:01

    This isn't quite the one-liner but it does what you're looking for. As always there are many ways of doing this. But here I am using '|' as the records separator and ':' as the field separator. That allows me to iterate over the fields in a record that contains math and only print the fields that don't contain <math></math>.

    BEGIN {RS="|";FS=":";ORS=""}
    
    /math/ {
        for (i=1;i<=NF;i++) {
            if ($i ~ /math/) {print ":\n"}
            else {print $i}
        }
        print "|";next;
    }
    
    /^\}/ {
        print "}";
        next;
    }
    
    {
        print $0"|"
    }
    
    END {print "\n"}
    
    0 讨论(0)
  • 2021-01-28 09:09

    If all data is so nicely formatted as in your example, then your solution is very close. I modified it only slightly

    in AWK:

    sub(/<math>.*/, "") {print; cut=1}
    /<\/math>/          {cut=0; next}
    !cut
    
    0 讨论(0)
  • 2021-01-28 09:10

    This can also be done using .. flip-flop(not range) operator without taking the whole file in memory and removing <math> from the starting point like:

    perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file
    
    0 讨论(0)
提交回复
热议问题