I was trying to figure out a way to modify a text file (specially deleting specific lines) without reading a big part of file into memory or rewriting the whole file. Here am talking about files larger than main memory about 15-50 Gigs.
P.S. I am using Linux.
You aren't going to get around making a new file, so just bite the bullet and do it. Use grep
with appropriate options and pipe the result to a second file:
$ grep -fv patternsToExcludeFromInput input > output
Another approach is to put patterns into, as examples, a hash table (Perl), a dictionary (Python), or an unordered_map
(C++), and process each line of your input file to look for matches.
If there is no match, print the line to the standard output stream (which you can pipe to a regular file). Your memory usage will be limited mostly to the hash table and the line of input you are querying.
If the file is way larger than memory, sed
is your friend. It acts as a filter between your old file and a new file, and at the end, you just have to rename the new file to the old name. The syntax is a bit strange for newcomers, but it is really powerful, being able to select lines by number, by regexes, or by range, and apply insertions, deletions or string substutions.
You can open the file in "rw" mode and use fseek, fread, fwrite to read/write portions of it. You must pay attention of not overwriting the part you have not read yet. So to delete a line you read and write forward, to insert a line you read and write backward (starting from the end of file).
To remove the first 100 bytes from the beginning of your file you could do something like:
FILE *fp = fopen(filename,"rw");
size_t BLOCK_SIZE = 1024;
char buffer[BLOCK_SIZE];
size_t offset = 100;
size_t length = ftell(fp);
for (size_t i=0; i< (length-offset+BLOCK_SIZE-1) / BLOCK_SIZE; ++i) {
fseek(fp,i*BLOCK_SIZE + offset,SEEK_SET);
size_t count = fread(fp,buffer,sizeof(char),BLOCK_SIZE);