Remove XML comments using Regex in bash

前端 未结 4 1243
清酒与你
清酒与你 2021-01-06 19:40

I want to remove XML comments in bash using regex (awk, sed, grep...) I have looked at other questions about this but they are missing something. Here\'s my xml code

<
相关标签:
4条回答
  • 2021-01-06 20:10

    In the end, you're going to have to recommend to your client/friend/instructor that they need to install some kind of XML processor. xmlstarlet is a good command line tool, but there are any number (or at least some number greater than 2) of implementations of XSLT which can be compiled for any standard Unix, and in most cases also for Windows. You really cannot do much XML processing with regex-based tools, and whatever you do will be hard to read, harder to maintain, and likely to fail on corner cases, sometimes with disastrous consequences.

    I haven't spent a lot of time polishing or reviewing the following little awk program. I think it will remove comments from compliant xml documents. Note that the following comment is not compliant:

    <!-- XML comments cannot include -- so this comment is illegal -->
    

    and it will not be treated correctly by my script.

    The following is also illegal, but since I've seen it in the wild and it wasn't hard to deal with, I did so:

    <!-------------- This comment is ill-formed but... -------------->
    

    Here it is. No guarantees. I know that it's hard to read, and I wouldn't want to maintain it. It may well fail on arbitrary corner cases.

    awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
         in_comment{next}
         {gsub(/<!--+([^-]|-[^-])*--+>/,"");
          in_comment=sub(/<!--+.*/,"");
          print}'
    
    0 讨论(0)
  • 2021-01-06 20:16

    You can use the pair 'perl-xmllint' to get this job done :

    cat yourFile.xml | perl -e 'while (<>) { next if (/Start.*End/ );if (/Start/) { while (<>) {last if (/End/) }}else {print "$_"; }} ' | xmllint --format -
    

    With Start = Your starting comment (in our case <!--) End = Your ending comment (in our case -->)

    I tried to use grep -vP without any good results because I did not find how to tell grep to understand the dot as new lines (the s modifier).

    0 讨论(0)
  • 2021-01-06 20:23
    xmlstarlet ed -d '//comment()' file.xml
    
    0 讨论(0)
  • 2021-01-06 20:25

    The most simple solution to remove all comments from a textfile I could come up with is:

    sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'
    

    To explain:

    The sed will put in a null char like this:

    <Table>
        \0<!--
       to be removed bla bla bla bla bla bl............
    
        removeee
    
        to be removeffffffffd
        -->\0
    
    <row>
            <column name="example"  value="1" ></column>
        </row>
    </Table>
    

    Than the grep -z will treat that character as "line separator"

    • <Table>\n
    • <!--\n to be removed bla bla bla bla bla bl............\n\n removeee\n\n to be removeffffffffd\n -->
    • \n\n<row>\n <column name="example" value="1" ></column>\n </row>\n</Table>\n

    grep -v will remove the middle part.

    And finally tr -d will remove the \0 again.


    In this case it should be applied to both files before comparing e.g.:

    diff <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file1.xml | grep -zv '^<!--' | tr -d '\0') <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file2.xml | grep -zv '^<!--' | tr -d '\0')
    

    or more readable with a function:

    stripcomments() {cat "$@" | sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'}
    
    diff <(stripcomments file1.xml) <(stripcomments file2.xml)
    

    In theory there might be some issues with CDATA blocks, as they can be used to have unbalanced comments, and there is a higher probability of them having important null-characters, but I have never seen such an xml file in real life.

    So for most valid xml-files this should work.

    0 讨论(0)
提交回复
热议问题