Removing all duplicate lines from a file using C [closed]

问题

In this question: Detecting duplicate lines on file using c i can detect duplicate lines, but how we can remove this lines from our file?

Thanks.

Edit : To add my code :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct somehash {
    struct somehash *next;
        unsigned hash;
        char *mem;
};

#define THE_SIZE 100000

struct somehash *table[THE_SIZE] = { NULL,};

struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);

int main (void)
{
    char buffer[100];
    struct somehash **pp;
    size_t len;
    FILE * pFileIn;
    FILE * pFileOut;

    pFileIn  = fopen("in.csv", "r");
    pFileOut  = fopen("out.csv", "w+");

    if (pFileIn==NULL) perror ("Error opening input file");
    if (pFileOut==NULL) perror ("Error opening output file");

    while (fgets(buffer, sizeof buffer, pFileIn)) {
            len = strlen(buffer);
            pp = some_find(buffer, len);
            if (*pp) { /* found */
                fprintf(stderr, "Duplicate:%s\n", buffer);
                }
            else    
        {       /* not found: create one */
                    fprintf(stdout, "%s", buffer);
                    fprintf(pFileOut, "%s", buffer);
                    *pp = malloc(sizeof **pp);
                    (*pp)->next = NULL;
                    (*pp)->hash = some_hash(buffer,len);
                    (*pp)->mem = malloc(1+len);
                    memcpy((*pp)->mem , buffer,  1+len);
                }
        }

return 0;
}

struct somehash **some_find(char *str, unsigned len)
{
    unsigned hash;
    unsigned short slot;
    struct somehash **hnd;

    hash = some_hash(str,len);
    slot = hash % THE_SIZE;
    for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
        if ( (*hnd)->hash != hash) continue;
            if ( strcmp((*hnd)->mem , str) ) continue;
                break;
        }

    return hnd;
}

static unsigned some_hash(char *str, unsigned len)
{
    unsigned val;
    unsigned idx;

    if (!len) len = strlen(str);

    val = 0;
    for(idx=0; idx < len; idx++ )   {
            val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
    }

    return val;
}

But in the output file we got always the first occurrence!

Edit 2: To clarify: the intent is to find all duplicates in an input file. When there is more than one instance of a line in the input, that line should not appear in the output at all. The intent is not just to remove duplicates of that line so each occurs only once, but to remove all instances of a line if that line is duplicated in the input.

回答1:

Essentially the only way to remove lines from a text file is to copy the file without those lines in the copy. The usual would be something on this order:

while (fgets(buffer, size, infile))
    if (search(your_hashtable, buffer) == NOT_FOUND) {
        fputs(line, outfile);
        insert(your_hashtable, buffer);
    }

If you want to save some storage space, you might store hashes instead of complete lines. In theory that could fail due to a hash collision, but if you use a cryptographic hash like SHA-256, chances of a collision are probably slower than the chances of a string comparison coming out wrong due to a CPU error. Besides: if you find a collision with SHA-256, you can probably get at least a little fame (if not fortune) from that alone.

Edit: As @Zack alluded to, the situation with hash size is basically a matter of deciding what chance of a collision you're willing to accept. With a crypographic 256-bit hash, the chances are so remote it's hardly worth considering. If you reduce that to, say, a 128-bit hash, the chances go up quite a bit, but they're still small enough for most practical purposes. On the other hand, if you were to reduce it to, say, a 32-bit CRC, chances of a collision are probably higher than I'd be happy accepting if the data mattered much.

I should probably mention one more possibility: another possibility would be to use a bit of a hybrid -- store something like a 32-bit CRC (which is really fast to compute) along with the offset where that line in the file starts. If your file never exceeds 4G, you can store both in only 8 bytes.

In this case, you work just a little differently: you start by computing the CRC, and the vast majority of the time, when it's not in the file, you copy the file to the output and insert those values in the hash table. When it is already in the table, you seek back to the possibly-identical line, read it back in, and compare to the current line. If they match, you go back to where you were and advance to the next line. If they don't match, you copy the current line to the output, and add its offset to the hash table.

Edit 2: Let's assume for the moment that the file is small enough that you can reasonably fit the whole thing in memory. In that case, you can store a line, and a line number where it occurred. If a line is already stored, you can change its line number to -1, to indicate that it was duplicated and shouldn't appear in the output.

In C++ (since it defines the relevant data structures), it could look something like this:

std::string line;

typedef std::map<std::string, int> line_record;

line_record lines;
int line_number = 1;

while (std::getline(line, infile)) {
    line_record::iterator existing = lines.find(line);
    if (existing != lines.end()) // if it was already in the map
        existing->second = -1;    // indicate that it's duplicated
    else
        lines.insert(std::make_pair(line, line_number); // otherwise, add it to map
    ++line_number;
}

Okay, that reads in the lines, and for each line, it checks whether it's already in the map. If it is, it sets the line_number to -1, to indicate that it won't appear in the output. If it wasn't it inserts it into the map along with its line number.

line_record::iterator pos;

std::vector<line_record::iterator> sortable_lines;

for (pos=lines.begin(); pos != lines.end(); ++pos)
    if (pos->second != -1)
        sortable_lines.push_back(pos);

This sets up sortable_lines as a vector of iterators into the map, so instead of copying entire lines, we'll just copy iterators (essentially like pointers) to those lines. It then copies the iterators into there, but only for lines where the line number isn't -1.

std::sort(sortable_lines.begin(), sortable_lines.end(), by_line_number());

struct by_line_number {
     bool operator()(line_record::iterator a, line_record::iterator b) { 
         return a->second < b->second;
     }
};

Then we sort those iterators by the line number.

for (int i=0; i<sortable_lines.size(); i++)
     outfile << sortable_lines[i]->first << "\n";

Finally, we copy each line to the output file, in order by their original line numbers.

来源：https://stackoverflow.com/questions/10200354/removing-all-duplicate-lines-from-a-file-using-c

标签

file-io

duplicate-removal