Split text file into smaller multiple text file using command line

后端 未结 9 2082
借酒劲吻你
借酒劲吻你 2020-12-22 18:03

I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.

I used:

split -l 5000 filena         


        
相关标签:
9条回答
  • 2020-12-22 18:32

    Here's an example in C# (cause that's what I was searching for). I needed to split a 23 GB csv-file with around 175 million lines to be able to look at the files. I split it into files of one million rows each. This code did it in about 5 minutes on my machine:

    var list = new List<string>();
    var fileSuffix = 0;
    
    using (var file = File.OpenRead(@"D:\Temp\file.csv"))
    using (var reader = new StreamReader(file))
    {
        while (!reader.EndOfStream)
        {
            list.Add(reader.ReadLine());
    
            if (list.Count >= 1000000)
            {
                File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
                list = new List<string>();
            }
        }
    }
    
    File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
    
    0 讨论(0)
  • 2020-12-22 18:37

    I know the question has been asked a long time ago, but I am surprised that nobody has given the most straightforward unix answer:

    split -l 5000 -d --additional-suffix=.txt $FileName file
    
    • -l 5000: split file into files of 5,000 lines each.
    • -d: numerical suffix. This will make the suffix go from 00 to 99 by default instead of aa to zz.
    • --additional-suffix: lets you specify the suffix, here the extension
    • $FileName: name of the file to be split.
    • file: prefix to add to the resulting files.

    As always, check out man split for more details.

    For Mac, the default version of split is apparently dumbed down. You can install the GNU version using the following command. (see this question for more GNU utils)

    brew install coreutils
    

    and then you can run the above command by replacing split with gsplit. Check out man gsplit for details.

    0 讨论(0)
  • 2020-12-22 18:38

    My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).

    So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).

    The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of test and made 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.

    The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.

    Option Explicit
    
    Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  'The full path to the big file
    Private Const REPEAT_HEADER_ROW = True                'Set to True to duplicate the header row in each part file
    Private Const LINES_PER_PART = 500000                 'The number of lines per part file
    
    Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart
    
    sStart = Now()
    
    sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
    iLineCounter = 0
    iOutputFile = 1
    
    Set oFileSystem = CreateObject("Scripting.FileSystemObject")
    Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
    Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
    
    If REPEAT_HEADER_ROW Then
        iLineCounter = 1
        sHeaderLine = oInputFile.ReadLine()
        Call oOutputFile.WriteLine(sHeaderLine)
    End If
    
    Do While Not oInputFile.AtEndOfStream
        sLine = oInputFile.ReadLine()
        Call oOutputFile.WriteLine(sLine)
        iLineCounter = iLineCounter + 1
        If iLineCounter Mod LINES_PER_PART = 0 Then
            iOutputFile = iOutputFile + 1
            Call oOutputFile.Close()
            Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
            If REPEAT_HEADER_ROW Then
                Call oOutputFile.WriteLine(sHeaderLine)
            End If
        End If
    Loop
    
    Call oInputFile.Close()
    Call oOutputFile.Close()
    Set oFileSystem = Nothing
    
    Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
    
    0 讨论(0)
提交回复
热议问题