问题
I need to remove the first instance of the <P> and </P> tags in multiple .htm files, all in a single directory, using a batch command. Any suggestions.
Edit - I just realized that there may be multiple DIVs in the .htm files, and so I would need to remove only the 1st instance of the <P> and </P> tags in each DIV (if any). And to clarify, I only want the tags removed, but do want the content/text in between the tags to remain. Thanks for the answers/comments thus far!!!
As for why, long story, but just know I work for an agency that has a contract with a vendor who did not test the version we paid for with IE11. As a result, only the first paragraph tag, when more than one paragraph, is making all text display 15 pixels lower than expected. I cannot change or modify the vendor's code, however, I can modify it after the elearning course has been exported. Which is what I need this batch file for. If I remove only the first instance of the paragraph tag on each page, then the entire text displays as expected.
回答1:
The safest solution (albeit perhaps the slowest and most complicated) would be to parse your HTML files as HTML and remove the first paragraph from the DOM. This would give you the benefit of not being restricted to any sort of dependable formatting of the HTML source. Comments are properly skipped, line breaks are handled correctly, and life is all sunshine and daisies. Parsing the HTML DOM can be done using an InternetExplorer.Application COM object. Here's a batch / JScript hybrid example:
@if (@CodeSection == @Batch) @then
@echo off
setlocal
for %%I in (*.html) do (
cscript /nologo /e:JScript "%~f0" "%%~fI"
)
rem // end main runtime
goto :EOF
@end
// end batch / begin JScript chimera
WSH.Echo(WSH.Arguments(0));
var fso = WSH.CreateObject('scripting.filesystemobject'),
IE = WSH.CreateObject('InternetExplorer.Application'),
htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0));
IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);
var p = IE.document.getElementsByTagName('p');
if (p && p[0]) {
/* If you want to remove the surrounding <p></p> only
while keeping the paragraph's inner content, uncomment
the following line: */
// while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);
p[0].parentNode.removeChild(p[0]);
htmlfile = fso.CreateTextFile(htmlfile, 1);
htmlfile.Write('<!DOCTYPE html>\n'
+ '<html>\n'
+ IE.document.documentElement.innerHTML
+ '\n</html>');
htmlfile.Close();
}
IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}
And because you're working with the DOM, additional tweaks are made easier. To delete the first <p>
element within each <div>
element (just as a wild example, not that anyone would ever want this ), navigate the DOM as you would in browser-based JavaScript.
@if (@CodeSection == @Batch) @then
@echo off
setlocal
for %%I in ("*.htm") do (
echo Batch section: "%%~fI"
cscript /nologo /e:JScript "%~f0" "%%~fI"
)
rem // end main runtime
goto :EOF
@end
// end batch / begin JScript chimera
WSH.Echo('JScript section: "' + WSH.Arguments(0) + '"');
var fso = WSH.CreateObject('scripting.filesystemobject'),
IE = WSH.CreateObject('InternetExplorer.Application'),
htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0)),
changed;
IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);
for (var d = IE.document.getElementsByTagName('div'), i = 0; i < d.length; i++) {
var p = d[i].getElementsByTagName('p');
if (p && p[0]) {
// move contents of p node up to parent
while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);
// delete now empty p node
p[0].parentNode.removeChild(p[0]);
changed = true;
}
}
if (changed) {
htmlfile = fso.CreateTextFile(htmlfile, 1);
htmlfile.Write('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'
+ '<HTML xmlns:t= "urn:schemas-microsoft-com:time" xmlns:control>\n'
+ IE.document.documentElement.innerHTML
+ '\n</HTML>');
htmlfile.Close();
}
IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}
回答2:
The solution you were probably expecting, a pure batch solution, would involve a bunch of for
loops. This example will strip the entire line(s) from the first <p>
to the first </p>
.
I'm sure npocmaka, MC ND, Aacini, jeb or dbenham can accomplish this with half the code and ten times the efficiency. *shrug*
This is the middle-of-the-road solution, offering more tolerance for line breaks within your <p>
tag than the PowerShell regexp replacement, but not quite as safe as the InternetExplorer.Application
COM object JScript hybrid.
@echo off
setlocal
for %%I in (*.html) do (
set p_on_line=
rem // get line number of first <p> tag
for /f "tokens=1 delims=:" %%n in (
'findstr /i /n "<p[^ar]" "%%~fI"'
) do if not defined p_on_line set "p_on_line=%%n"
if defined p_on_line (
rem // process file line-by-line
setlocal enabledelayedexpansion
for /f "delims=" %%L in ('findstr /n "^" "%%~fI"') do (
call :split num line "%%L"
rem // If <p> has not yet been reached, copy line to new file
if !num! lss !p_on_line! (
>>"%%~dpnI.new" echo(!line!
) else (
rem // If </p> has been reached, resume writing.
if not "!line!"=="!line:</p>=!" set p_on_line=2147483647
)
)
endlocal
if exist "%%~dpnI.new" move /y "%%~dpnI.new" "%%~fI" >NUL
)
)
goto :EOF
:split <num_var> <line_var> <string>
setlocal disabledelayedexpansion
set "line=%~3"
for /f "tokens=1 delims=:" %%I in ("%~3") do set "num=%%I"
set "line=%line:*:=%"
endlocal & set "%~1=%num%" & set "%~2=%line%"
goto :EOF
回答3:
@ECHO Off
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "destdir=U:\destdir"
PUSHD "%sourcedir%"
FOR /f "delims=" %%f IN ('dir /b /a-d "q28443084*" ') DO ((
SET "zap=<P>"
FOR /f "usebackqdelims=" %%a IN ("%%f") DO (
IF DEFINED zap (
SET "line=%%a"
CALL :process
IF DEFINED keep (ECHO(%%a) ELSE (iF DEFINED line CALL ECHO(%%line%%)
) ELSE (ECHO(%%a)
)
)>"%destdir%\%%f"
)
popd
GOTO :EOF
:process
SET "keep="
CALL SET "line2=%%line:%zap%=%%"
IF "%line%" equ "%line2%" SET "keep=y"&GOTO :EOF
SET "line=%line2%"
IF "%zap%"=="</P>" SET "zap="&GOTO :EOF
SET "zap=</P>"
IF NOT DEFINED line GOTO :EOF
SET "line=%line2:</P>=%"
IF "%line%" neq "%line2%" SET "zap="
GOTO :eof
This may work - it will suppress empty lines.
I chose to process files matching the mask q28443084*
in directory u:\sourcedir
to matching filenames in u:\destdir
- you would need to change these settings to suit.
The process revolves around the setting of zap
, which may be set to either <P>
, </P>
or nothing. The incoming line is examined, and either kept as-is if it does not contain zap
or is output in modified form and zap
adjusted to the next value. if zap
is nothing then just reproduce input to output.
回答4:
The shortest solution would be to use a PowerShell one-liner.
powershell -command "gci '*.html' | %{ ([regex]'<p\W.*?</p>').replace([IO.File]::ReadAllText($_),'',1) | sc $_ }"
Please note that this will only work if there are no line breaks within the first paragraph. If there's a line break between <p>
and </p>
this will keep searching until it finds a paragraph that doesn't have a line break. You might be better off trying to fix the vendor's broken CSS than this hackish workaround.
Anyway, the command above roughly translates thusly:
- In the current directory, get child items matching
*.html
For each matching html file (the
%
is an alias forforeach-object
):- Create a regex object matching from
<p
to shining</p>
Call that regex object's
replace
method with the following params:- use the HTML file contents as the haystack,
- replace the needle with nothing,
- and do this 1 time.
Set the content of the HTML file to be the result.
- Create a regex object matching from
I used [IO.File]::ReadAllText($_) rather than gc $_
to preserve line breaks. Using get-content
with [regex].replace
mashes everything together into one line. I used a [regex]
object rather than a simpler -replace
switch because -replace
is global.
回答5:
Here's a similar solution to the HTML DOM answer. If your HTML is valid, you could try to parse it as XML. The advantage here is, where the InternetExplorer.Application
COM object loads an entire fully-bloated instance of Internet Explorer for each page load, instead you're loading only a dll (msxml3.dll). This should hopefully handle multiple files more efficiently. The down side is that the XML parser is finicky about the validity of your tag structure. If, for example, you have an unordered list where the list items are not closed:
<ul>
<li>Item 1
<li>Item 2
</ul>
... a web browser would understand that just fine, but the XML parser will probably error. Anyway, it's worth a shot. I just tested this on a directory of 500 identical HTML files, and it worked through them in less than a minute.
@if (@CodeSection == @Batch) @then
@echo off
setlocal
for %%I in ("*.htm") do (
cscript /nologo /e:JScript "%~f0" "%%~fI"
)
rem // end main runtime
goto :EOF
@end
// end batch / begin JScript chimera
WSH.StdOut.Write('Checking ' + WSH.Arguments(0) + '... ');
var fso = WSH.CreateObject('scripting.filesystemobject'),
DOM = WSH.CreateObject('Microsoft.XMLDOM'),
htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
html = htmlfile.ReadAll().split(/<\/head\b.*?>/i),
head = html[0] + '</head>',
body = html[1].replace(/<\/html\b.*?>/i,''),
changed;
htmlfile.Close();
// attempt to massage body string into valid XHTML
var self_closing_tags = ['area','base','br','col',
'command','comment','embed','hr','img','input',
'keygen','link','meta','param','source','track','wbr'];
body = body.replace(/<\/?\w+/g, function(m) { return m.toLowerCase(); }).replace(
RegExp([ // should match <br>
'<(',
'(' + self_closing_tags.join('|') + ')',
'([^>]+[^\/])?', // for tags with properties, tag is unclosed
')>'
].join(''), 'ig'), "<$1 />"
);
DOM.loadXML(body);
DOM.async = false;
if (DOM.parseError.errorCode) {
WSH.Echo(DOM.parseError.reason);
WSH.Quit(0);
}
for (var d = DOM.documentElement.getElementsByTagName('div'), i = 0; i < d.length; i++) {
var p = d[i].getElementsByTagName('p');
if (p && p[0]) {
// move contents of p node up to parent
while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);
// delete now empty p node
p[0].parentNode.removeChild(p[0]);
changed = true;
}
}
html = head + DOM.documentElement.xml + '</html>';
if (changed) {
htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
htmlfile.Write(html);
htmlfile.Close();
WSH.Echo('Fixed!');
}
else WSH.Echo('Nothing to change.');
回答6:
For posterity, I found another solution. O.P. was having problems with browser security and group policy restrictions preventing the InternetExplorer.Application
COM object from behaving as expected, and the HTML he's fixing cannot reasonably be massaged into valid XML for the Microsoft.XMLDOM
parser. But I'm optimistic that the htmlfile
COM object won't suffer from these same infirmities.
As I emailed the O.P.:
Peppered around Google searches I found occasional references to a mysterious COM object called "htmlfile". It appears to be a way to build and interact with the HTML DOM without using the IE engine. I can't find any documentation on it on MSDN, but I managed to scrape together enough methods and properties from trial and error to make the script work.
I've since discovered that there's more to the htmlfile
COM object than meets the eye -- htmlfileObj.parentWindow.clipboardData for example (MSDN reference).
Anyway, I was most optimistic about this solution, but O.P. has stopped returning my emails. Perhaps it'll be useful to someone else though.
@if (@CodeSection == @Batch) @then
@echo off
setlocal
for %%I in ("*.htm") do cscript /nologo /e:JScript "%~f0" "%%~fI"
rem // end main runtime
goto :EOF
@end
// end batch / begin JScript chimera
WSH.StdOut.Write(WSH.Arguments(0) + ': ');
var fso = WSH.CreateObject('scripting.filesystemobject'),
DOM = WSH.CreateObject('htmlfile'),
htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
html = htmlfile.ReadAll(),
head = html.split(/<body\b.*?>/i)[0],
bodyTag = html.match(/<body\b.*?>/i)[0],
changed;
DOM.write(html);
htmlfile.Close();
if (DOM.getElementsByName('p_tag_fixed').length) {
WSH.Echo('fix already applied.');
WSH.Quit(0);
}
for (var d = DOM.body.getElementsByTagName('div'), i = 0; i < d.length; i++) {
var p = d[i].getElementsByTagName('p');
if (p && p[0]) {
// move contents of p node up to parent
while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);
// delete now empty p node
p[0].parentNode.removeChild(p[0]);
changed = true;
}
}
if (changed) {
htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
htmlfile.Write(
head
+ '<meta name="p_tag_fixed" />'
+ bodyTag
+ DOM.body.innerHTML
+ '</body></html>'
);
htmlfile.Close();
WSH.Echo('Fixed!')
}
else WSH.Echo('unchanged.');
来源:https://stackoverflow.com/questions/28443084/how-to-edit-1st-instance-of-text-in-multiple-htm-files-using-batch-command