Replacing multiple new lines in a file with just one

后端 未结 5 2007
轻奢々
轻奢々 2021-01-22 12:00

This function is supposed to search through a text file for the new line character. When it finds the newline character, it increments the newLine counter, and when

相关标签:
5条回答
  • 2021-01-22 12:46

    [EDITED] The minimal change is:

    if ( newLine <= 2)
    

    forgive me and forget the previous code.

    a slightly simpler alternative:

    int c;
    int duplicates=0;
    while ((c = fgetc(fileContents)) != EOF)
    {
        if (c == '\n') {
            if (duplicates > 1) continue;
            duplicates++;
        }
        else {
            duplicates=0;
        }
        putchar(c);
    }
    
    0 讨论(0)
  • 2021-01-22 12:48

    Diagnosis

    The logic looks correct if you have Unix line endings. If you have Windows CRLF line endings but are processing the file on Unix, you have a CR before each LF, and the CR resets newLine to zero, so you get the message for each newline.

    This would explain what you're seeing.

    It would also explain why everyone else is saying your logic is correct (it is — provided that the lines end with just LF and not CRLF) but you are seeing an unexpected result.

    How to resolve it?

    Fair question. One major option is to use dos2unix or an equivalent mechanism to convert the DOS file into a Unix file. There are many questions on the subject on SO.

    If you don't need the CR ('\r' in C) characters at all, you can simply delete (not print, and not zero newLine) those.

    If you need to preserve the CRLF line endings, you'll need to be a bit more careful. You'll have to record that you got a CR, then check that you get an LF, then print the pair, and then check whether you get any more CRLF sequences and suppress those, etc.

    Working code — dupnl.c

    This program only reads from standard input; this is more flexible than only reading from a fixed file name. Learn to avoid writing code which only works with one file name; it will save you lots of recompilation over time. Th code handles Unix-style files with newlines ("\n") only at the end; it also handles DOS files with CRLF ("\r\n") endings; and it also handles (old style) Mac (Mac OS 9 and earlier) files with CR ("\r") line endings. In fact, it handes arbitrary interleavings of the different line ending styles. If you want enforcement of a single mode, you have to do some work to decide which mode, and then use an appropriate subset of this code.

    #include <stdio.h>
    
    int main(void)
    {
        FILE *fp = stdin;       // Instead of fopen()
        int newLine = 1;
        int c; 
    
        while ((c = fgetc(fp)) != EOF)
        {   
            if (c == '\n')
            {
                /* Unix NL line ending */
                if (newLine++ == 0)
                    putchar(c); 
            }
            else if (c == '\r')
            {
                int c1 = fgetc(fp);
                if (c1 == '\n')
                {
                    /* DOS CRLF line ending */
                    if (newLine++ == 0)
                    {
                        putchar(c);
                        putchar(c1);
                    }
                }
                else
                {
                    /* MAC CR line ending */
                    if (newLine++ == 0)
                        putchar(c);
                    if (c1 != EOF && c1 != '\r')
                        ungetc(c1, stdin);
                }
            }
            else
            {
                putchar(c); 
                newLine = 0;
            }
        }
    
        return 0;
    }
    

    Example run — inputs and outputs

    $ cat test.unx
    
    
    data long enough to be seen 1 - Unix
    
    data long enough to be seen 2 - Unix
    data long enough to be seen 3 - Unix
    data long enough to be seen 4 - Unix
    
    
    
    data long enough to be seen 5 - Unix
    
    
    $ sed 's/Unix/DOS/g' test.unx | ule -d > test.dos
    $ cat test.dos
    
    
    data long enough to be seen 1 - DOS
    
    data long enough to be seen 2 - DOS
    data long enough to be seen 3 - DOS
    data long enough to be seen 4 - DOS
    
    
    
    data long enough to be seen 5 - DOS
    
    
    $ sed 's/Unix/Mac/g' test.unx | ule -m > test.mac
    $ cat test.mac
    $ ta long enough to be seen 5 - Mac
    $ odx test.mac
    0x0000: 0D 0D 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75   ..data long enou
    0x0010: 67 68 20 74 6F 20 62 65 20 73 65 65 6E 20 31 20   gh to be seen 1 
    0x0020: 2D 20 4D 61 63 0D 0D 64 61 74 61 20 6C 6F 6E 67   - Mac..data long
    0x0030: 20 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73 65    enough to be se
    0x0040: 65 6E 20 32 20 2D 20 4D 61 63 0D 64 61 74 61 20   en 2 - Mac.data 
    0x0050: 6C 6F 6E 67 20 65 6E 6F 75 67 68 20 74 6F 20 62   long enough to b
    0x0060: 65 20 73 65 65 6E 20 33 20 2D 20 4D 61 63 0D 64   e seen 3 - Mac.d
    0x0070: 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68 20   ata long enough 
    0x0080: 74 6F 20 62 65 20 73 65 65 6E 20 34 20 2D 20 4D   to be seen 4 - M
    0x0090: 61 63 0D 0D 0D 0D 64 61 74 61 20 6C 6F 6E 67 20   ac....data long 
    0x00A0: 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73 65 65   enough to be see
    0x00B0: 6E 20 35 20 2D 20 4D 61 63 0D 0D 0D               n 5 - Mac...
    0x00BC:
    $ dupnl < test.unx
    data long enough to be seen 1 - Unix
    data long enough to be seen 2 - Unix
    data long enough to be seen 3 - Unix
    data long enough to be seen 4 - Unix
    data long enough to be seen 5 - Unix
    $ dupnl < test.dos
    data long enough to be seen 1 - DOS
    data long enough to be seen 2 - DOS
    data long enough to be seen 3 - DOS
    data long enough to be seen 4 - DOS
    data long enough to be seen 5 - DOS
    $ dupnl < test.mac
    $ ta long enough to be seen 5 - Mac
    $ dupnl < test.mac | odx
    0x0000: 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68   data long enough
    0x0010: 20 74 6F 20 62 65 20 73 65 65 6E 20 31 20 2D 20    to be seen 1 - 
    0x0020: 4D 61 63 0D 64 61 74 61 20 6C 6F 6E 67 20 65 6E   Mac.data long en
    0x0030: 6F 75 67 68 20 74 6F 20 62 65 20 73 65 65 6E 20   ough to be seen 
    0x0040: 32 20 2D 20 4D 61 63 0D 64 61 74 61 20 6C 6F 6E   2 - Mac.data lon
    0x0050: 67 20 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73   g enough to be s
    0x0060: 65 65 6E 20 33 20 2D 20 4D 61 63 0D 64 61 74 61   een 3 - Mac.data
    0x0070: 20 6C 6F 6E 67 20 65 6E 6F 75 67 68 20 74 6F 20    long enough to 
    0x0080: 62 65 20 73 65 65 6E 20 34 20 2D 20 4D 61 63 0D   be seen 4 - Mac.
    0x0090: 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68   data long enough
    0x00A0: 20 74 6F 20 62 65 20 73 65 65 6E 20 35 20 2D 20    to be seen 5 - 
    0x00B0: 4D 61 63 0D                                       Mac.
    0x00B4:
    $
    

    The lines starting $ ta are where the prompt overwrites the previous output (and the 'long enough to be seen' part is because my prompt is normally longer than just $).

    odx is a hex dump program. ule is for 'uniform line endings' and analyzes or transforms data so it has uniform line endings.

    Usage: ule [-cdhmnsuzV] [file ...]
      -c  Check line endings (default)
      -d  Convert to DOS (CRLF) line endings
      -h  Print this help and exit
      -m  Convert to MAC (CR) line endings
      -n  Ensure line ending at end of file
      -s  Write output to standard output (default)
      -u  Convert to Unix (LF) line endings
      -z  Check for zero (null) bytes
      -V  Print version information and exit
    
    0 讨论(0)
  • 2021-01-22 12:54

    What the sample code resolved is:

    1) squeeze the consecutive a few '\n' to just one '\n'

    2) Get rid the leading '\n' at the beginning if there is any.

      input:   '\n\n\naa\nbb\n\ncc' 
      output:   aa'\n'    
                bb'\n' //notice, there is no blank line here
                cc
    

    If it was the aim, then your code logic is correct for it.

    • By defining newLine = 1 , it will get rid of any leading '\n' of the input txt.

    • And when there is a remained '\n' after processing, it will output a new line to give a hint.

    Back to the question itself, if the actual aim is to squeeze consecutive blank lines to just one blank line(which needs two consecutive '\n', one for terminate previous line, one for blank line).

    1) Let's confirm the input and expected output firstly,

    Input text:

    aaa'\n' //1st line, there is a '\n' append to 'aaa'  
    '\n'    //2nd line, blank line
    bbb'\n' //3rd line, there is a '\n' append to 'bbb'
    '\n'    //4th line, blank line
    '\n'    //5th line, blank line
    '\n'    //6th line, blank line
    ccc     //7th line,
    

    Expected Output text:

    aaa'\n' //1st line, there is a '\n' append to 'aaa'  
    '\n'    //2nd line, blank line
    bbb'\n' //3rd line, there is a '\n' append to 'bbb'
    '\n'    //4th line, blank line
    ccc     //5th line,
    

    2) If it is the exact program target as above,then

    if (c == '\n')
    {
        newLine++;
        if (newLine < 3) // here should be 3 to print '\n' twice,
                         // one for 'aaa\n', one for blank line 
        {
            //printf("new line");
            putchar(c); 
        }
    }
    

    3) If you have to process the Windows format file(with \r\n ending) under Cygwin, then you could do as follows

    while ((c = fgetc(fileContents)) != EOF)
    {   
        if ( c == '\r') continue;// add this line to discard possible '\r'
        if (c == '\n')
        {
            newLine++;
            if (newLine < 3) //here should be 3 to print '\n' twice
            {
                printf("new line");
                putchar(c); 
            }
        }
        else 
        {
            putchar(c); 
            newLine = 0;
        }
    }
    
    0 讨论(0)
  • 2021-01-22 12:54

    Dry ran the code: If file starts with a newline character and newLine is 1:

    For the first iteration:

    if (c == '\n') //Will be evaluated as true for a new-line character. 
    {
        newLine++; //newLine becomes 2 before next if condition is evaluated.
        if (newLine < 2) //False, since newLine is not less than 2, but equal.
        {
            printf("new line");
            putchar(c); 
        }
    }
    else //Not entered
    {
        putchar(c); 
        newLine = 0;
    }
    

    On the second iteration: (Assume that it is a consecutive newline char case)

    if (c == '\n') //Will be evaluated as true for a new-line character.
    {
        newLine++; //newLine becomes 3 before next if condition is evaluated.
        if (newLine < 2) //False, since newLine is greater than 2.
        {
            printf("new line");
            putchar(c); 
        }
    }
    else //Not entered
    {
        putchar(c); 
        newLine = 0;
    }
    

    So,

    • Initialize newLine to 0.
    0 讨论(0)
  • 2021-01-22 12:56
    if newline > 2     
    

    That should be greater than or equal to if you want to get rid of the second line. Also you have newline strarting at one, then being incremented to two then reset to zero. Instead I recommend replacing the count with a boolean like

    boolean firstNewlineFound = false
    

    Then Whenever you find a newline set it to true; whenever it is true, delete onenewline and set it back to false.

    0 讨论(0)
提交回复
热议问题