How to remove XML tags from Unix command line?

后端 未结 5 1029
再見小時候
再見小時候 2021-01-31 18:43

I am grepping an XML File, which gives me output like this:

data
more data
...

Note, this is a fl

相关标签:
5条回答
  • 2021-01-31 19:10

    Use html2text command-line tool, which converts html into plain text.

    Alternatively you may try ex-way:

    ex -s +'%s/<[^>].\{-}>//ge' +%p +q! file.txt
    

    or:

    cat file.txt | ex -s +'%s/<[^>].\{-}>//ge' +%p +q! /dev/stdin
    
    0 讨论(0)
  • 2021-01-31 19:12

    Give this a try:

    grep -Po '<.*?>\K.*?(?=<.*?>)' inputfile
    

    Explanation:

    Using Perl Compatible Regular Expressions (-P) and outputting only the specified matches (-o):

    • <.*?> - Non-greedy match of any characters within angle brackets
    • \K - Don't include the preceding match in the output (reset match start - similar to positive look-behind, but it works with variable-length matches)
    • .*? - Non-greedy match stopping at the next match (this part will be output)
    • (?=<.*?>) - Non-greedy match of any characters within angle brackets and don't include the match in the output (positive look-ahead - works with variable-length matches)
    0 讨论(0)
  • 2021-01-31 19:15

    I know this is not a "perlgolf contest", but I used to use this trick.

    Set Record Separator for < or >, then print only odd lines:

    awk -vRS='<|>' NR%2 file.xml
    
    0 讨论(0)
  • 2021-01-31 19:17

    Using awk:

    awk '{gsub(/<[^>]*>/,"")};1' file.xml
    
    0 讨论(0)
  • 2021-01-31 19:20

    If your file looks just like that, then sed can help you:

    sed -e 's/<[^>]*>//g' file.xml
    

    Of course you should not use regular expressions for parsing XML because it's hard.

    0 讨论(0)
提交回复
热议问题