Why does my tool output overwrite itself and how do I fix it?

前端 未结 3 2084
梦如初夏
梦如初夏 2020-11-21 15:04

The intent of this question is to provide an answer to the daily questions whose answer is \"you have DOS line endings\" so we can simply close them as duplicates of this on

相关标签:
3条回答
  • 2020-11-21 15:29

    Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.

    If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).

    There is a similar dos2unix deb package available for Debian based systems.

    From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.

    This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!

    tr -d '\r' < infile > outfile
    
    0 讨论(0)
  • 2020-11-21 15:39

    You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.

    So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.

    Given:

    $ printf "what\risgoingon\r\n" > file
    $ od -c file
    0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \r  \n
    0000020
    

    Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):

    $ perl -pe 's/\R$/\n/' file | od -c
    0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
    0000017
    $ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
    0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
    0000017
    

    (Note the \r between the two words is correctly left alone)

    If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.

    With straight POSIX tools, your best bet is likely awk like so:

    $ awk '{sub(/\r$/,"")} 1' file | od -c
    0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
    0000017
    

    Things that kinda work (but know your limitations):

    tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):

    $ tr -d "\r" < file | od -c
    0000000    w   h   a   t   i   s   g   o   i   n   g   o   n  \n        
    0000016
    

    GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.

    GNU sed only:

    $ sed 's/\x0D//' file | od -c   # also sed 's/\r//'
    0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
    0000017
    

    The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.

    0 讨论(0)
  • 2020-11-21 15:42

    The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.

    So your input file wasn't really just:

    what isgoingon
    

    it was actually:

    what isgoingon\r\n
    

    as you can see with cat -v:

    $ cat -vE file
    what isgoingon^M$
    

    and od -c:

    $ od -c file
    0000000   w   h   a   t       i   s   g   o   i   n   g   o   n  \r  \n
    0000020
    

    so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:

    <what> <isgoingon\r>
    

    Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:

    print $2, $1
    

    awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.

    To fix the problem, do either of these:

    dos2unix file
    sed 's/\r$//' file
    awk '{sub(/\r$/,"")}1' file
    perl -pe 's/\r$//' file
    

    Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).

    Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.

    Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:

    gawk -v RS='\r\n' '...' file
    

    but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.

    One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:

    "field1","field2.1
    field2.2","field3"
    

    is really:

    "field1","field2.1\nfield2.2","field3"\r\n
    

    so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:

    gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
    

    Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.

    0 讨论(0)
提交回复
热议问题