Why does my tool output overwrite itself and how do I fix it?

前端未结

关注

 3  2085

梦如初夏 2020-11-21 15:04

The intent of this question is to provide an answer to the daily questions whose answer is \"you have DOS line endings\" so we can simply close them as duplicates of this on

3条回答

无人及你 (楼主)

2020-11-21 15:39
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.

So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.

Given:
```
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \r  \n
0000020
```
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
```
$ perl -pe 's/\R$/\n/' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017
```
(Note the \r between the two words is correctly left alone)

If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.

With straight POSIX tools, your best bet is likely awk like so:
```
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017
```
Things that kinda work (but know your limitations):

tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
```
$ tr -d "\r" < file | od -c
0000000    w   h   a   t   i   s   g   o   i   n   g   o   n  \n        
0000016
```
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.

GNU sed only:
```
$ sed 's/\x0D//' file | od -c   # also sed 's/\r//'
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017
```
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...