Get content between a pair of HTML tags using Bash

前端 未结 6 620
野趣味
野趣味 2020-11-30 09:53

I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:





        
相关标签:
6条回答
  • 2020-11-30 10:38

    BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.

    It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.

    0 讨论(0)
  • 2020-11-30 10:46

    Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:

    xidel in.html  -e '/html/body/node()' --printed-node-format=html
    
    0 讨论(0)
  • 2020-11-30 10:47

    Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.

    Example:

    curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'
    
    0 讨论(0)
  • 2020-11-30 10:52

    plain text processing is not good for html/xml parsing. I hope this could give you some idea:

    kent$  xmllint --xpath "//body" f.html 
    <body>
     text
      <div>
      text2
        <div>
            text3
        </div>
      </div>
    </body>
    
    0 讨论(0)
  • 2020-11-30 10:53

    Using sed in shell/bash, so you needn't install something else.

    tag=body
    sed -n "/<$tag>/,/<\/$tag>/p" file
    
    0 讨论(0)
  • 2020-11-30 10:57

    Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

    $ hxselect -c body <<HTML
      <html>
      <head>
      </head>
      <body>
        text
        <div>
          text2
          <div>
            text3
          </div>
        </div>
      </body>
      </html>
      HTML 
    

    to get what you need. Plain and simple.

    0 讨论(0)
提交回复
热议问题