How to edit 1st instance of text in multiple htm files using batch command?

问题

I need to remove the first instance of the and tags in multiple .htm files, all in a single directory, using a batch command. Any suggestions.

Edit - I just realized that there may be multiple DIVs in the .htm files, and so I would need to remove only the 1st instance of the and tags in each DIV (if any). And to clarify, I only want the tags removed, but do want the content/text in between the tags to remain. Thanks for the answers/comments thus far!!!

As for why, long story, but just know I work for an agency that has a contract with a vendor who did not test the version we paid for with IE11. As a result, only the first paragraph tag, when more than one paragraph, is making all text display 15 pixels lower than expected. I cannot change or modify the vendor's code, however, I can modify it after the elearning course has been exported. Which is what I need this batch file for. If I remove only the first instance of the paragraph tag on each page, then the entire text displays as expected.

回答1:

The safest solution (albeit perhaps the slowest and most complicated) would be to parse your HTML files as HTML and remove the first paragraph from the DOM. This would give you the benefit of not being restricted to any sort of dependable formatting of the HTML source. Comments are properly skipped, line breaks are handled correctly, and life is all sunshine and daisies. Parsing the HTML DOM can be done using an InternetExplorer.Application COM object. Here's a batch / JScript hybrid example:

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in (*.html) do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo(WSH.Arguments(0));

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0));

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

var p = IE.document.getElementsByTagName('p');

if (p && p[0]) {

    /* If you want to remove the surrounding <p></p> only
    while keeping the paragraph's inner content, uncomment
    the following line: */

    // while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

    p[0].parentNode.removeChild(p[0]);
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE html>\n'
        + '<html>\n'
        + IE.document.documentElement.innerHTML
        + '\n</html>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}

And because you're working with the DOM, additional tweaks are made easier. To delete the first  element within each <div> element (just as a wild example, not that anyone would ever want this ), navigate the DOM as you would in browser-based JavaScript.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    echo Batch section: "%%~fI"
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo('JScript section: "' + WSH.Arguments(0) + '"');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0)),
    changed;

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

for (var d = IE.document.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'
        + '<HTML xmlns:t= "urn:schemas-microsoft-com:time" xmlns:control>\n'
        + IE.document.documentElement.innerHTML
        + '\n</HTML>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}

回答2:

The solution you were probably expecting, a pure batch solution, would involve a bunch of for loops. This example will strip the entire line(s) from the first  to the first .

I'm sure npocmaka, MC ND, Aacini, jeb or dbenham can accomplish this with half the code and ten times the efficiency. *shrug*

This is the middle-of-the-road solution, offering more tolerance for line breaks within your  tag than the PowerShell regexp replacement, but not quite as safe as the InternetExplorer.Application COM object JScript hybrid.

@echo off
setlocal

for %%I in (*.html) do (

    set p_on_line=

    rem // get line number of first <p> tag
    for /f "tokens=1 delims=:" %%n in (
        'findstr /i /n "<p[^ar]" "%%~fI"'
    ) do if not defined p_on_line set "p_on_line=%%n"

    if defined p_on_line (

        rem // process file line-by-line
        setlocal enabledelayedexpansion
        for /f "delims=" %%L in ('findstr /n "^" "%%~fI"') do (
            call :split num line "%%L"

            rem // If <p> has not yet been reached, copy line to new file
            if !num! lss !p_on_line! (
                >>"%%~dpnI.new" echo(!line!
            ) else (
                rem // If </p> has been reached, resume writing.
                if not "!line!"=="!line:</p>=!" set p_on_line=2147483647
            )
        )
        endlocal
        if exist "%%~dpnI.new" move /y "%%~dpnI.new" "%%~fI" >NUL
    )
)

goto :EOF

:split <num_var> <line_var> <string>
setlocal disabledelayedexpansion
set "line=%~3"
for /f "tokens=1 delims=:" %%I in ("%~3") do set "num=%%I"
set "line=%line:*:=%"
endlocal & set "%~1=%num%" & set "%~2=%line%"
goto :EOF

回答3:

@ECHO Off
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "destdir=U:\destdir"
PUSHD "%sourcedir%"
FOR /f "delims=" %%f IN ('dir /b /a-d "q28443084*" ') DO ((
 SET "zap=<P>"
 FOR /f "usebackqdelims=" %%a IN ("%%f") DO (
  IF DEFINED zap (
   SET "line=%%a"
   CALL :process
   IF DEFINED keep (ECHO(%%a) ELSE (iF DEFINED line CALL ECHO(%%line%%)
  ) ELSE (ECHO(%%a)
 )
 )>"%destdir%\%%f"
)
popd

GOTO :EOF

:process
SET "keep="
CALL SET "line2=%%line:%zap%=%%"
IF "%line%" equ "%line2%" SET "keep=y"&GOTO :EOF
SET "line=%line2%"
IF "%zap%"=="</P>" SET "zap="&GOTO :EOF 
SET "zap=</P>"
IF NOT DEFINED line GOTO :EOF 
SET "line=%line2:</P>=%"
IF "%line%" neq "%line2%" SET "zap="
GOTO :eof

This may work - it will suppress empty lines.

I chose to process files matching the mask q28443084*in directory u:\sourcedir to matching filenames in u:\destdir - you would need to change these settings to suit.

The process revolves around the setting of zap, which may be set to either ,  or nothing. The incoming line is examined, and either kept as-is if it does not contain zap or is output in modified form and zap adjusted to the next value. if zap is nothing then just reproduce input to output.

回答4:

The shortest solution would be to use a PowerShell one-liner.

powershell -command "gci '*.html' | %{ ([regex]'<p\W.*?</p>').replace([IO.File]::ReadAllText($_),'',1) | sc $_ }"

Please note that this will only work if there are no line breaks within the first paragraph. If there's a line break between  and  this will keep searching until it finds a paragraph that doesn't have a line break. You might be better off trying to fix the vendor's broken CSS than this hackish workaround.

Anyway, the command above roughly translates thusly:

In the current directory, get child items matching *.html
For each matching html file (the % is an alias for foreach-object):
- Create a regex object matching from <p to shining 
- Call that regex object's replace method with the following params:
 - use the HTML file contents as the haystack,
 - replace the needle with nothing,
 - and do this 1 time.
- Set the content of the HTML file to be the result.

I used [IO.File]::ReadAllText($_) rather than gc $_ to preserve line breaks. Using get-content with [regex].replace mashes everything together into one line. I used a [regex] object rather than a simpler -replace switch because -replace is global.

回答5:

Here's a similar solution to the HTML DOM answer. If your HTML is valid, you could try to parse it as XML. The advantage here is, where the InternetExplorer.Application COM object loads an entire fully-bloated instance of Internet Explorer for each page load, instead you're loading only a dll (msxml3.dll). This should hopefully handle multiple files more efficiently. The down side is that the XML parser is finicky about the validity of your tag structure. If, for example, you have an unordered list where the list items are not closed:

<ul>
    <li>Item 1
    <li>Item 2
</ul>

... a web browser would understand that just fine, but the XML parser will probably error. Anyway, it's worth a shot. I just tested this on a directory of 500 identical HTML files, and it worked through them in less than a minute.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write('Checking ' + WSH.Arguments(0) + '... ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('Microsoft.XMLDOM'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll().split(/<\/head\b.*?>/i),  
    head = html[0] + '</head>',
    body = html[1].replace(/<\/html\b.*?>/i,''),
    changed;

htmlfile.Close();

// attempt to massage body string into valid XHTML
var self_closing_tags = ['area','base','br','col',
    'command','comment','embed','hr','img','input',
    'keygen','link','meta','param','source','track','wbr'];

body = body.replace(/<\/?\w+/g, function(m) { return m.toLowerCase(); }).replace(
    RegExp([    // should match <br>
        '<(',
            '(' + self_closing_tags.join('|') + ')',
            '([^>]+[^\/])?',    // for tags with properties, tag is unclosed
        ')>'
    ].join(''), 'ig'), "<$1 />"
);  

DOM.loadXML(body);
DOM.async = false;

if (DOM.parseError.errorCode) {
   WSH.Echo(DOM.parseError.reason);
   WSH.Quit(0);
}

for (var d = DOM.documentElement.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

html = head + DOM.documentElement.xml + '</html>';

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(html);
    htmlfile.Close();
    WSH.Echo('Fixed!');
}
else WSH.Echo('Nothing to change.');

回答6:

For posterity, I found another solution. O.P. was having problems with browser security and group policy restrictions preventing the InternetExplorer.Application COM object from behaving as expected, and the HTML he's fixing cannot reasonably be massaged into valid XML for the Microsoft.XMLDOM parser. But I'm optimistic that the htmlfile COM object won't suffer from these same infirmities.

As I emailed the O.P.:

Peppered around Google searches I found occasional references to a mysterious COM object called "htmlfile". It appears to be a way to build and interact with the HTML DOM without using the IE engine. I can't find any documentation on it on MSDN, but I managed to scrape together enough methods and properties from trial and error to make the script work.

I've since discovered that there's more to the htmlfile COM object than meets the eye -- htmlfileObj.parentWindow.clipboardData for example (MSDN reference).

Anyway, I was most optimistic about this solution, but O.P. has stopped returning my emails. Perhaps it'll be useful to someone else though.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do cscript /nologo /e:JScript "%~f0" "%%~fI"

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write(WSH.Arguments(0) + ': ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('htmlfile'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll(),
    head = html.split(/<body\b.*?>/i)[0],
    bodyTag = html.match(/<body\b.*?>/i)[0],
    changed;

DOM.write(html);
htmlfile.Close();

if (DOM.getElementsByName('p_tag_fixed').length) {
    WSH.Echo('fix already applied.');
    WSH.Quit(0);
}

for (var d = DOM.body.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);

        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(
        head
        + '<meta name="p_tag_fixed" />'
        + bodyTag
        + DOM.body.innerHTML
        + '</body></html>'
    );
    htmlfile.Close();
    WSH.Echo('Fixed!')
}
else WSH.Echo('unchanged.');

来源：https://stackoverflow.com/questions/28443084/how-to-edit-1st-instance-of-text-in-multiple-htm-files-using-batch-command

标签

batch-file

text

edit

internet-explorer-11