I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.
I used:
split -l 5000 filena
You can maybe do something like this with awk
awk '{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}' yourfile
Basically, it calculates the name of the output file by taking the record number (NR) and dividing it by 5000, adding 1, taking the integer of that and zero-padding to 2 places.
By default, awk
prints the entire input record when you don't specify anything else. So, print > outfile
writes the entire input record to the output file.
As you are running on Windows, you can't use single quotes because it doesn't like that. I think you have to put the script in a file and then tell awk
to use the file, something like this:
awk -f script.awk yourfile
and script.awk
will contain the script like this:
{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}
Or, it may work if you do this:
awk "{outfile=sprintf(\"file%02d.txt\",NR/5000+1);print > outfile}" yourfile
here is one in c# that doesn't run out of memory when splitting into large chunks! I needed to split 95M file into 10M x line files.
var fileSuffix = 0;
int lines = 0;
Stream fstream = File.OpenWrite($"{filename}.{(++fileSuffix)}");
StreamWriter sw = new StreamWriter(fstream);
using (var file = File.OpenRead(filename))
using (var reader = new StreamReader(file))
{
while (!reader.EndOfStream)
{
sw.WriteLine(reader.ReadLine());
lines++;
if (lines >= 10000000)
{
sw.Close();
fstream.Close();
lines = 0;
fstream = File.OpenWrite($"{filename}.{(++fileSuffix)}");
sw = new StreamWriter(fstream);
}
}
}
sw.Close();
fstream.Close();
I have created a simple program for this and your question helped me complete the solution... I added one more feature and few configurations. In case you want to add a specific character/ string after every few lines (configurable). Please go through the notes. I have added the code files : https://github.com/mohitsharma779/FileSplit
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=100
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
CALL :select
FOR /f "tokens=1*delims==" %%b IN ('set dfile') DO IF /i "%%b"=="dfile" >>"%%c" ECHO(%%a
)
GOTO :EOF
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
SET "dfile=%sourcedir%\file%fcount:~-2%.txt"
GOTO :EOF
Here's a native windows batch that should accomplish the task.
Now I'll not say that it'll be fast (less than 2 minutes for each 5Kline output file) or that it will be immune to batch character-sensitivites. Really depends on the characteristics of your target data.
I used a file named q25249516.txt
containing 100Klines of data for my testing.
Revised quicker version
REM
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=199
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
CALL :select
>>"%sourcedir%\file$$.txt" ECHO(%%a
)
SET /a lcount=%llimit%
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
MOVE /y "%sourcedir%\file$$.txt" "%sourcedir%\file%fcount:~-2%.txt" >NUL 2>nul
GOTO :EOF
Note that I used llimit
of 50000 for testing. Will overwrite the early file numbers if llimit
*100 is gearter than the number of lines in the file (cure by setting fcount
to 1999
and use ~3
in place of ~2
in file-renaming line.)
Syntax looks like:
$ split [OPTION] [INPUT [PREFIX]]
where prefix is PREFIXaa, PREFIXab, ...
Just use proper one and youre done or just use mv for renameing.
I think
$ mv * *.txt
should work but test it first on smaller scale.
:)
This "File Splitter" Windows command line program works nicely: https://github.com/dubasdey/File-Splitter
It's open source, simple, documented, proven, and worked for me.
Example:
fsplit -split 50 mb mylargefile.txt