In Powershell, what's the most efficient way to split a large text file by record type?

后端 未结 2 1274
清酒与你
清酒与你 2021-02-04 14:36

I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.

If I were just fi

相关标签:
2条回答
  • 2021-02-04 14:39

    Given the size of input files, you definitely want to process a line at a time. I wouldn't think the re-opening/closing of the output files would be too huge a perf hit. It certainly makes the implemation possible using the pipeline even as a one-liner - really not too different from your impl. I wrapped it here to get rid of the horizontal scrollbar:

    gc foo.log | %{switch ($_.Substring(0,3)) {
        '001'{$input | out-file output001.txt -enc ascii -append} `
        '002'{$input | out-file output002.txt -enc ascii -append} `
        '003'{$input | out-file output003.txt -enc ascii -append}}}
    
    0 讨论(0)
  • 2021-02-04 14:42

    Reading

    As for reading the file and parsing, I would go with switch statement:

    switch -file c:\temp\stackoverflow.testfile2.txt -regex {
      "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
      "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
      "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
    }
    

    I think it is better approach because

    • there is support for regex, you don't have to make substring (which might be expensive) and
    • the parameter -file is quite handy ;)

    Writing

    As for writing the output, I'll test to use streamwriter, however if performance of Add-Content is decent for you, I would stick to it.

    Added: Keith proposed to use >> operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.

    Look at my test:

    [1]: (measure-command {
    >>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
    >>             '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
    >>             '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
    >>             '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
    >> }).TotalSeconds
    >>
    159,1585874
    [2]: (measure-command {
    >>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
    >>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
    >>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
    >>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
    >> }).TotalSeconds
    >>
    9,2696923
    

    The difference is huge.

    Just for comparison:

    [3]: (measure-command {
    >>     $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
    >>     while (!$reader.EndOfStream) {
    >>         $line = $reader.ReadLine();
    >>         switch ($line.substring(0,3)) {
    >>             "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
    >>             "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
    >>             "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
    >>             }
    >>         }
    >>     $reader.close()
    >> }).TotalSeconds
    >>
    8,2454369
    [4]: (measure-command {
    >>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
    >>         "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
    >>         "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
    >>         "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
    >>     }
    >> }).TotalSeconds
    8,6755565
    

    Added: I was curious about the writing performance .. and I was a little bit surprised

    [8]: (measure-command {
    >>     $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
    >>     $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
    >>     $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
    >>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
    >>         "^001" {$sw1.WriteLine($_)}
    >>         "^002" {$sw2.WriteLine($_)}
    >>         "^003" {$sw3.WriteLine($_)}
    >>     }
    >>     $sw1.Close()
    >>     $sw2.Close()
    >>     $sw3.Close()
    >>
    >> }).TotalSeconds
    >>
    0,1062315
    

    It is 80 times faster. Now you you have to decide - if speed is important, use StreamWriter. If code clarity is important, use Add-Content.


    Substring vs. Regex

    According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:

    [102]: (measure-command {
    >>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
    >>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
    >>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
    >>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
    >> }).TotalSeconds
    >>
    9,0654496
    [103]: (measure-command {
    >>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch -regex ($_) {
    >>             '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
    >>             '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
    >>             '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
    >> }).TotalSeconds
    >>
    9,2563681
    

    So the difference is not important and for me, regexes are more readable.

    0 讨论(0)
提交回复
热议问题