问题
I have a large (9 GiB), ASCII encoded, pipe delimited file with UNIX-style line endings; 0x0A.
I want to sample the first 100 records into a file for investigation. The following will produce 100 records (1 header record and 99 data records). However, it changes the line endings to DOS/Winodws style; CRLF, 0x0D0A.
Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100 |
Out-File -FilePath .\elig.txt -Encoding ascii
I know about iconv, recode, and dos2unix. Those programs are not on my system and are not permitted to be installed. I have searched and found a number of places on how to get to CRLF. I have not found anything on getting to or keeping LF.
How can I produce the file with LF line endings instead of CRLF?
回答1:
You could join the lines from the Get-Content cmdlet with the Unix "`n" newline and save that.
Something like
((Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100) -join "`n") |
Out-File -FilePath .\elig.txt -Encoding ascii -NoNewLine
回答2:
To complement Theo's helpful answer with a performance optimization based on the little-used -ReadCount
parameter:
Set-Content -NoNewLine -Encoding ascii .\outfile.txt -Value (
(Get-Content -First 100 -ReadCount 100 .\file.txt) -join "`n") + "`n"
)
-First 100
instructs Get-Content to read (at most)100
lines.-ReadCount 100
causes these 100 lines to be read and emitted at once, as an array, which speeds up reading and subsequent processing.- Note: In PowerShell [Core] v7.0+ you can use shorthand
-ReadCount 0
in combination with-First <n>
to mean: read the requested<n>
lines as a single array; due to a bug in earlier versions, including Windows PowerShell,-ReadCount 0
always reads the entire file, even in the presence of-First
(aka-TotalCount
aka-Head
).
Also, even as of PowerShell [Core] 7.0.0-rc.2 (current as of this writing), combining-ReadCount 0
with-Last <n>
(aka-Tail
) should be avoided (for now): while output produced is correct, behind the scenes it is again the whole file that is read; see this GitHub issue.
- Note: In PowerShell [Core] v7.0+ you can use shorthand
Note the
+ "`n"
, which ensures that the output file will have a trailing newline as well (which text files in the Unix world are expected to have).
While the above also works with -Last <n>
(-Tail <n>
) to extract from the end of the file, Theo's (slower) Select-Object
solution offers more flexibility with respect to extracting arbitrary ranges of lines, thanks to available parameters -Skip
, -SkipLast
, and -Index
; however, offering these parameters also directly on Get-Content
for superior performance is being proposed in this GitHub feature request.
Also note that I've used Set-Content
instead of Out-File
.
If you know you're writing text, Set-Content
is sufficient and generally faster (though in this case this won't matter, given that the data to write is passed as a single value).
For a comprehensive overview of the differences between Set-Content
and Out-File
/ >
, see this answer.
Set-Content
vs. Out-File
benchmark:
Note: This benchmark compares the two cmdlets with respect to writing many input strings received via the pipeline to a file.
# Sample array of 100,000 lines.
$arr = (, 'foooooooooooooooooooooo') * 1e5
# Time writing the array lines to a file, first with Set-Content, then
# with Out-File.
$file = [IO.Path]::GetTempFileName()
{ $arr | Set-Content -Encoding Ascii $file },
{ $arr | Out-File -Encoding Ascii $file } | % { (Measure-Command $_).TotalSeconds }
Remove-Item $file
Sample timing in seconds from my Windows 10 VM with Windows PowerShell v5.1:
2.6637108 # Set-Content
5.1850954 # Out-File; took almost twice as long.
来源:https://stackoverflow.com/questions/60157755/how-can-i-keep-unix-lf-line-endings