How can I validate large numbers of files with search and replace?

徘徊边缘 提交于 2019-12-05 17:33:39

See questions I asked in comment at top.

Assuming you're using GNU sed, and that you're trying to add the trailing / to your tags to make XML-compliant <img /> and <input />, then replace the sed expression in your command with this one, and it should do the trick: '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}'

Here it is on a simple test file (SO's colorizer doing wacky things):

$ cat test.html
This is an <img tag> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag > without closing slash.
And here one <input attrib="1" 
    > that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

$ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html
This is an <img tag/> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag /> without closing slash.
And here one <input attrib="1" 
    /> that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

Here's GNU sed regex syntax and how the buffering works to do multiline search/replace.

Alternately you could use something like Tidy that's designed for sanitizing bad HTML -- that's what I'd do if I were doing anything more complicated than a couple of simple search/replaces. Tidy's options get complicated fast, so it's usually better to write a script in your scripting language of choice (Python, Perl) that calls libtidy and sets whatever options you need.

Try this. It'll go through your files, make a .orig backup of each file (perl's -i operator), and replace <img> and <input> tags with <img /> and <input >.

find . \! -path '*.svn*' -type f -exec perl -pi.orig -e 's{ ( <(?:img|input)\b ([^>]*?) ) \ ?/?> }{$1\ />}sgxi' {} \;

Given input:

<img>  <img/>  <img src="..">  <img src="" >
<input>  <input/>  <input id="..">  <input id="" >

It changes the file to:

<img />  <img />  <img src=".." />  <img src="" />
<input />  <input />  <input id=".." />  <input id="" />

Here's what the regexp is doing:

s{(<(?:img|input)\b ([^>]*?)) # capture "<img" or "<input" followed by non-">" chars
  \ ?/?>}                     # optional space, optional slash, followed by ">"
{$1\ />}sgxi                  # replace with: captured text, plus " />"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!