How do I process a text file in C by chunks of lines?

后端 未结 3 1317
滥情空心
滥情空心 2021-01-21 12:01

I\'m writing a program in C that processes a text file and keeps track of each unique word (by using a struct that has a char array for the word and a count for its number of oc

相关标签:
3条回答
  • 2021-01-21 12:45

    First try reading one line at a time. Scan the line buffer for word boundaries and fine-tune the word counting part. Using a hash table to store the words and counts seems a good approach. Make the output optional, so you can measure read/parse/lookup performance.

    Then make another program that uses the same algorithm for the core part but uses mmap to read sizeable parts of the file and scan the block of memory. The tricky part is handling the block boundaries.

    Compare output from both programs on a set of huge files, ensure the counts are identical. You can create huge files by concatenating the same file many times.

    Compare timings too. Use the time command line utility. Disable output for this benchmark to focus on the read/parse/analysis part.

    Compare the timings with other programs such as wc or cat - > /dev/null. Once you get similar performance, the bottleneck is the speed of reading from mass storage, there is not much room left for improvement.

    EDIT: looking at your code, I have these remarks:

    • fscanf is probably not the right tool: at least you should protect for buffer overflow. How should you handle foo,bar 1 word or 2 words?

    • I would suggest using fgets() or fread and moving a pointer along the buffer, skipping the non word bytes, converting the word bytes to lower case with an indirection through a 256 byte array, avoiding copies.

    • Make the locking stuff optional via a preprocessor variable. It is not needed if the words structure is only accessed by a single thread.

    • How did you implement add? What is q?

    0 讨论(0)
  • 2021-01-21 12:48

    How about getline() ? Here an example from the manpage http://man7.org/linux/man-pages/man3/getline.3.html

       #define _GNU_SOURCE
       #include <stdio.h>
       #include <stdlib.h>
    
       int
       main(void)
       {
           FILE *stream;
           char *line = NULL;
           size_t len = 0;
           ssize_t read;
    
           stream = fopen("/etc/motd", "r");
           if (stream == NULL)
               exit(EXIT_FAILURE);
    
           while ((read = getline(&line, &len, stream)) != -1) {
               printf("Retrieved line of length %zu :\n", read);
               printf("%s", line);
           }
    
           free(line);
           fclose(stream);
           exit(EXIT_SUCCESS);
       }
    
    0 讨论(0)
  • 2021-01-21 12:50

    This is best done by reading some manuals but I can provide a headstart.

    FILE *fp;
    fp=fopen("fileToRead.txt", "rb");
    if (!fp) { /* handle failure! */ }
    #define GUESS_FOR_LINE_LENGTH 80
    char sentinel = '\0';
    while ((sentinel = getc(fp)) != EOF)
    {
        ungetc(sentinel, fp);
        char buffer[20000*GUESS_FOR_LINE_LENGTH];
        size_t numRead = fread(buffer, 1, 20000*GUESS_FOR_LINE_LENGTH, fp);
        if (numRead < 20000*GUESS_FOR_LINE_LENGTH) { /*last run */ }
        /* now buffer has numRead characters */
        size_t lastLine = numRead - 1;
        while (buffer[lastLine] != '\n') { --lastLine; }
        /* process up to lastLine */
        /* copy the remainder from lastLine to the front */
        /* and fill the remainder from the file */
    }
    

    This is really more like pseudo-code. Since you mostly have a working program, you should use this as a guideline.

    0 讨论(0)
提交回复
热议问题