Count the reocurrence of words in text file

前端 未结 1 1995
一整个雨季
一整个雨季 2021-01-07 08:16

Expanding on my a previous exercise, I have a text file that is filled with one word per line.

hello
hi
hello
bonjour
bonjour
hello

As I re

相关标签:
1条回答
  • 2021-01-07 08:42

    You may be making things a bit harder than necessary, and you are certainly allocating 997 more structures than necessary in the case of your input file. There is no need to allocate all 1000 structs up front. (you are free to do so, it's just a memory management issue). The key is that you only need allocate a new struct each time a unique word is encountered. (in the case of your data file, 3-times). For all other cases, you are simply updating count to add the occurrence for a word you have already stored.

    Also, if there is no compelling reason to use a struct, it is just as easy to use an array of pointers-to-char as your pointers to each word, and then a simple array of int [1000] as your count (or frequency) array. Your choice. In the case of two arrays, you only need to allocate for each unique word and never need a separate allocation for each struct.

    Putting those pieces together, you could reduce your code (not including the file -- which can be handled by simple redirection) to the following:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    enum { MAXC = 128, MAXW = 1000 };
    
    struct wordfreq{
        int count;
        char *word;
    };
    
    int main (void) {
    
        struct wordfreq *words[MAXW] = {0};
        char tmp[MAXC] = "";
        int n = 0;
    
        /* while < MAXW unique words, read each word in file */
        while (n < MAXW && fscanf (stdin, " %s", tmp) == 1) {
            int i;
    
            for (i = 0; i < n; i++)         /* check against exising words */
                if (strcmp (words[i]->word, tmp) == 0)  /* if exists, break */
                    break;
    
            if (i < n) {            /* if exists */
                words[i]->count++;  /* update frequency */
                continue;           /* get next word */
            }
    
            /* new word found, allocate struct and
             * allocate storage for word (+ space for nul-byte) 
             */
            words[n] = malloc (sizeof *words[n]);
            words[n]->word = malloc (strlen (tmp) + 1);
            if (!words[n] || !words[n]->word) { /* validate ALL allocations */
                fprintf (stderr, "error: memory exhausted, words[%d].\n", n);
                break;
            }
            words[n]->count = 0;    /* initialize count */
    
            strcpy (words[n]->word, tmp); /* copy new word to words[n] */
            words[n]->count++;            /* update frequency to 1 */
            n++;                          /* increment word count */
        }
    
        for (int i = 0; i < n; i++) { /* for each word */
            printf ("%s = %d\n", words[i]->word, words[i]->count);
            free (words[i]->word);    /* free memory when no longer needed */
            free (words[i]);
        }
    
        return 0;
    }
    

    Example Input File

    $ cat dat/wordfile.txt
    hello
    hi
    hello
    bonjour
    bonjour
    hello
    

    Example Use/Output

    $ ./bin/filewordfreq <dat/wordfile.txt
    hello = 3
    hi = 1
    bonjour = 2
    

    As with any code that dynamically allocates memory, you will want to validate your use of the memory to insure you have not written beyond the bounds or based a conditional move or jump on an uninitialized value. In Linux, valgrind is the natural choice (there are similar programs for each OS). Just run you program through it, e.g.:

    $ valgrind ./bin/filewordfreqstruct <dat/wordfile.txt
    ==2000== Memcheck, a memory error detector
    ==2000== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
    ==2000== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
    ==2000== Command: ./bin/filewordfreqstruct
    ==2000==
    hello = 3
    hi = 1
    bonjour = 2
    ==2000==
    ==2000== HEAP SUMMARY:
    ==2000==     in use at exit: 0 bytes in 0 blocks
    ==2000==   total heap usage: 6 allocs, 6 frees, 65 bytes allocated
    ==2000==
    ==2000== All heap blocks were freed -- no leaks are possible
    ==2000==
    ==2000== For counts of detected and suppressed errors, rerun with: -v
    ==2000== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
    

    Verify that you free all memory you allocate and that there are no memory errors.

    Look things over and let me know if you have any further questions.


    Using 2-Arrays Instead of a struct

    As mentioned above, sometimes using a storage array and a frequency array can simplify accomplishing the same thing. Whenever you are faced with needing the frequency of any "set", your first thought should be a frequency array. It is nothing more than an array of the same size as the number of items in your "set", (initialized to 0 at the beginning). The same approach applies, when you add (or find a duplicate of an existing) element in your storage array, you increment the corresponding element in your frequency array by 1. When you are done, your frequency array elements hold the frequency the corresponding elements in your storage array appear.

    Here is an equivalent to the program above.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    enum { MAXC = 128, MAXW = 1000 };
    
    int main (void) {
    
        char *words[MAXW] = {NULL},   /* storage array of pointers to char* */
            tmp[MAXC] = "";
        int freq[MAXW] = {0}, n = 0;  /* simple integer frequency array */
    
        /* while < MAXW unique words, read each word in file */
        while (n < MAXW && fscanf (stdin, " %s", tmp) == 1) {
            int i;
    
            for (i = 0; words[i]; i++)   /* check against exising words */
                if (strcmp (words[i], tmp) == 0)    /* if exists, break */
                    break;
    
            if (words[i]) {     /* if exists */
                freq[i]++;      /* update frequency */
                continue;       /* get next word */
            }
    
            /* new word found, allocate storage (+ space for nul-byte) */
            words[n] = malloc (strlen (tmp) + 1);
            if (!words[n]) {    /* validate ALL allocations */
                fprintf (stderr, "error: memory exhausted, words[%d].\n", n);
                break;
            }
    
            strcpy (words[n], tmp); /* copy new word to words[n] */
            freq[n]++;              /* update frequency to 1 */
            n++;                    /* increment word count */
        }
    
        for (int i = 0; i < n; i++) {                 /* for each word */
            printf ("%s = %d\n", words[i], freq[i]);  /* output word + freq */
            free (words[i]);           /* free memory when no longer needed */
        }
    
        return 0;
    }
    

    Using this approach, you eliminate 1/2 of your memory allocations by using a statically declared frequency array for your count. Either way is fine, it is largely up to you.

    0 讨论(0)
提交回复
热议问题