How to cut html tag from very large multiline text file with content with use perl, sed or awk?

后端 未结 4 1839
伪装坚强ぢ
伪装坚强ぢ 2021-01-28 08:10

I want to transform this text (remove .*?) with sed, awk or perl:

{|
|-
| colspan=\"2\"|
: 
[\\underbrace{\\col         


        
4条回答
  •  臣服心动
    2021-01-28 08:54

    This should do it:

    perl -0777 -pe 's!.*?!!sg' dirt-math.txt
    

    -p says we're doing a sed-like readline/printline loop, -0777 says each "line" is actually the whole input file, and -e specifies the code to run (on each "line" (file)).


    If your text files are too big to fit into memory (?!), you can try this:

    perl -pe 's!.*?!!s; if ($cut) { if (s!^.*?!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!.*!!s) { $cut = 1 }' dirt-math.txt
    

    or (slightly more readable):

    perl -pe '
        s!.*?!!g;
        if ($cut) {
            if (s!^.*?!!) { $cut = 0 }
            else { $_ = "" }
        }
        if (!$cut && s!.*!!s) { $cut = 1 }
    ' dirt-math.txt
    

    This is effectively a little state machine.

    $cut records whether we're in an unclosed tag (and so need to cut out input). If so, we check whether we were able to find/remove . If so, we're done cutting (we found a closing tag); otherwise we overwrite the "current line" with the empty string ($_ = ""; this is the actual cutting part).

    If, after this, we're not cutting (we're not using else to handle the case where ... not math text appears on a single line), we try to remove ... from the input. If so, we've just seen an opening tag and need to start cutting.

提交回复
热议问题