cs50 pset4 Recover - why does writing entire file to memory fail check50?

问题

I'm working on Recover and am wondering if there's something fundamentally flawed with my approach. The walkthrough suggests using fread() to go through a file in 512 byte chunks, look for JPEG headers, and write to a new file - 00X.jpg - each time one is found. What I tried instead was creating an arbitrarily large temporary buffer with malloc(), using fread()'s return value to determine the size of the file, and writing the entirety of the file to an array of structs with two data types; .B for BYTE, to store the file, and .header for bool, to indicate where each JPEG header begins.

I'm running into two problems. One is that recovered images don't pass check50, and two is that trying to write more than one byte at a time from my array results in garbage bytes. Here's what I'm doing:

typedef uint8_t BYTE;
typedef struct
{
    BYTE B;
    bool header;
}
images;

This defines the data type BYTE and my struct using both bytes and bools.

BYTE *tmp_buffer = malloc(4000000 * sizeof(BYTE));
int counter = fread(tmp_buffer, sizeof(BYTE), 4000000, file);
images buffer[counter];

This creates the arbitrarily large buffer with malloc(), uses it and the return value of fread to determine the byte size of the file, and then creates a buffer in memory to work with.

for (int copy = 0; copy < counter; copy++)
{
    buffer[copy].header = false;
    buffer[copy].B = tmp_buffer[copy];
}
free(tmp_buffer);
fclose(file);
for (int check = 0; check < counter; check++)
{
    if (buffer[check].B == 0xff && buffer[check + 1].B == 0xd8 && buffer[check + 2].B == 0xff)
    {
        buffer[check].header = true;
    }
}

This copies every byte from the 'temporary' buffer to the permanent one, sets all of the headers to false, and then closes the file/frees the memory. Afterwards, it finds the JPEG headers and sets them to true. From here is me experimenting to see what works:

int headers_counter = 1;
for (int header_location = 0; header_location < counter; header_location+= 512)
{
    if (buffer[header_location].header == true)
    {
        printf("%i. %i\n", headers_counter, header_location);
        headers_counter++;
    }
}

This prints the number and array (not byte) position of every header in the original file, and it appears to work. I say 'appears' because the following code does recover an image:

int file_number = 0;
char file_name[8];
sprintf(file_name, "%03i.jpg", file_number);
FILE *img = fopen(file_name, "w");
for (int i = 1024; i < 115200; i++)
{
    fwrite(&buffer[i].B, sizeof(BYTE), 1, img);
}

This is not intended to solve the entire problem, i.e. recover all 50 images. It's only intended to recover 000.jpg by beginning at the the first byte of 000.jpg's header and ending at the last byte before 001.jpg's header (edit: it's a hard-coded example using the header locations printed to the terminal above, also an example). It appears to do so, but it fails check50 with the error "recovered image does not match."

My girlfriend is also taking the class, and she implemented her code the way the walkthrough suggests. We opened our 000.jpg files in hex output and compared. We didn't go through every byte, but the first and last few dozen rows appeared to be identical, slack space and all.

The other thing I mentioned is garbage characters when writing more than one byte at a time. If I change my final loop to this:

for (int i = 1024; i < 115200; i+= 512)
{
    fwrite(&buffer[i].B, sizeof(BYTE), 512, img);
}

Then it works even less and 000.jpg says it's an invalid or unsupported image format. I looked at the hex output, and this is what I see when comparing the first row of my original loop and the one above that increments by 512:

ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01
ff 01 d8 00 ff 00 e0 00 00 00 10 00 4a 00 46 00

There's an extra byte in every other position! I'm at a loss here. At this point, it's more about understanding these behaviors. I'm sure there's a logical explanation for both, but it's driving me crazy! I tried doing an array of bytes instead of the struct with a bool added, and it did the same thing.

回答1:

As written above in the comments, by trying to use a struct and by trying to store each jpg to be written out as once -- you are making things much harder than need be. As the directions discuss the FAT filesystem (which was on the card where the images were taken from), stores chunks of each file in 512 byte sectors. To scan the card, all you need is a 512 byte buffer to handle the read and immediate write to its output file. No structures are needed and there is no need to dynamically allocate memory.

The way to approach the read is to read each 512 block of data from the file. You then need to check if the first 4-bytes of the block hold the jpg header. A short function to test for you could be written as:

#include <stdio.h>
#include <stdlib.h>

#define FNLEN 128       /* if you need a constant, #define one (or more) */
#define BLKSZ 512

/* check if first 4-bytes in buf match jpg header */
int chkjpgheader (const unsigned char *buf)
{
    return  buf[0] == 0xff && 
            buf[1] == 0xd8 && 
            buf[2] == 0xff && 
            buf[3] >> 4 == 0xe;
}

(you simply test if each condition is true returning the result of the conditional)

Thinking how to handle scanning for jpg headers and reading the file, you can do it all in a single loop that reads 512 bytes from input, and keeping a counter of the number of jpg headers found -- which you also use as a flag to indicate a header was found. You will read the block of data, test if it is a header, if so, if not the first header, close the output file for the last jpg file written, create a new filename, open the file (validating each step) and then write the data out as you loop checking the start of each 512 byte block for the header signature. Repeat until you run out of file.

You can implement that similar to:

/* find each jpg header and write contents to separate file_000x.jpg files.
 * returns the number of jpg files successfully recovered.
 */
int recoverjpgs (FILE *ifp)
{
    char jpgname[FNLEN] = "";       /* jpg output filename */
    unsigned char buf[BLKSZ];       /* read buffer */
    int jpgcnt = 0;                 /* found jpg header count*/
    size_t nbytes;                  /* no. of bytes read/written */
    FILE *fp = NULL;                /* FILE* pointer for jpg output */
    
    /* read until jpg header found */
    while ((nbytes = fread (buf, 1, BLKSZ, ifp)) > 0) {
        /* check if jpg header found */
        if (nbytes >= 4 && chkjpgheader(buf)) {
            /* if not 1st header, close current file */
            if (jpgcnt) {
                if (fclose (fp) == EOF) {   /* validate every close-after-write */
                    perror ("recoverjpg()-fclose");
                    return jpgcnt - 1;
                }
            }
            /* create output filename (e.g. file_0001.jpg) */
            sprintf (jpgname, "file_%04d.jpg", jpgcnt + 1);
            /* open next file/validate file open for writing */
            if ((fp = fopen (jpgname, "wb")) == NULL) {
                perror ("fopen-outfile");
                return jpgcnt;
            }
            jpgcnt += 1;    /* increment recovered jpg count */
        }
        /* if header found - write block in buf to output file */
        if (jpgcnt && fwrite (buf, 1, nbytes, fp) != nbytes) {
            perror ("recoverjpg()-fwrite");
            return jpgcnt - 1;
        }
    }
    /* if file opened, close final file */
    if (jpgcnt && fclose (fp) == EOF) {     /* validate every close-after-write */
        perror ("recoverjpg()-fclose");
        return jpgcnt - 1;
    }
    
    return jpgcnt;  /* return number of jpg files recovered */
}

(note: jpgcnt is used both as a counter and a flag to control when the first fclose() on a jpg file occurs and to control when the first write to the first file occurs.)

Look at the returns. Understand why jpgcnt or jpgcnt - 1 is being returned at different places in the function. Also understand why you always check the return of fclose() after-a-write has taken place. There a number of errors that can occur when the final data is flushed to the file and the file is closed -- which would not be caught by the last checking the last write. So rule -- always validate close-after-write. There is no need for the check when closing your input file.

That's all you need. In main() you will open the input file and simply pass the open filestream to the recoverjpgs() function saving the return to know how many jpg files were successfully recovered. It can be as simple as:

int main (int argc, char **argv) {
    
    FILE *fp = NULL;            /* input file stream pointer */
    int jpgcnt = 0;             /* count of jpg files recovered */
    
    if (argc < 2 ) {    /* validate 1 argument given for filename */
        fprintf (stderr, "error: insufficient input,\n"
                         "usage: %s filename\n", argv[0]);
        return 1;
    }
    
    /* open file/validate file open for reading */
    if ((fp = fopen (argv[1], "rb")) == NULL) {
        perror ("fopen-argv[1]");
        return 1;
    }
    
    if ((jpgcnt = recoverjpgs(fp)))
        printf ("recovered %d .jpg files.\n", jpgcnt);
    else
        puts ("no jpg files recovered.");
        
    fclose (fp);
}

That is the complete program, just copy/paste the 3-pieces together and give it a try.

Example Use/Output

$ ./bin/recover ~/doc/c/cs50/recover/card.raw
recovered 50 .jpg files.

(the 50 files, file_0001.jpg to file_0050.jpg will be created in the current directory -- and you can enjoy the balloons, flowers, girls, etc... shown in the jgp files.)

Look things over and let me know if you have further questions.

Edit Per-Comment Regarding Allocating and Storing Each File to Write Once

Even if you want to buffer each file fully before writing once, the idea of using a struct with a single uint8_t (byte) and a bool to flag whether that struct is a header byte doesn't make much sense. Why? It makes a mess out of the write routine. Which would have to check every struct in an allocated block large enough to hold the entire card.raw file when writing to catch the 4-struct sequence where each struct has its bool flag set true -- essentially duplicating all testing that was done during the read to find the header bytes and set your bool struct member true to begin with.

As mentioned, if there were zillions of files, you would want to scan through the input stream from card.raw and save the bytes for each jpg in your buffer so that they could be written once to the file while the process continues (you could even fork the write to a separate process so the read could continue without waiting for the write if you really wanted to tweak things.

Regardless, the approach will be the same. If you dynamically allocate for buf, you can fill it with each jpg file and when the next header is found -- write the current contents of buf up to the beginning of the next header to your file, (the move the next header read to the start of buf) and repeat until you run out of input to check.

You will reuse the allocated storage for buf throughout the process and only expanding if the current file requires more storage than currently allocated. (so buf ends up sized to hold the largest jpg found at the end of the day). This minimizes allocations and means the only reallocs required over all 50 files are the reallocs needed when a larger file is encountered. If the next 20 files all fit within the currently allocated buffer -- no adjustment is needed and you keep filling buf over and over again with the different jpg file contents as they are recovered from the "forensic image" (sounds important)

There are only the addition of a bufsz variable to track the current allocation size of buf and a total variable to track the total bytes read in each jpg file. Other than that you are just rearranging where the files are written so that you wait until one complete jpg has been read into buf before opening and writing those bytes to the file and then closing the file immediately after the file is written (a short function was written to handle that -- since it made sense to write a generic-reusable function to write a given number of bytes from a buffer to a file of a given name.

The complete file could be written as follows.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#define FNLEN 128       /* if you need a constant, #define one (or more) */
#define BLKSZ 512
#define JPGSZ 1<<15     /* 32K initial allocation size */

/* write 'nbytes' from 'buf' to 'fname'. returns number of bytes
 * written on success, zero otherwise.
 */ 
size_t writebuf2file (const char *fname, void *buf, size_t nbytes)
{
    FILE *fp = NULL;    /* FILE* pointer for jpg output */
    
    /* open file/validate file open for writing */
    if ((fp = fopen (fname, "wb")) == NULL) {
        perror ("writebuf2file-fopen");
        return 0;
    }
    /* write buffer to file/validate bytes written */
    if (fwrite (buf, 1, nbytes, fp) != nbytes) {
        perror ("writebuf2file()-fwrite");
        return 0;
    }
    /* close file/validate every close-after-write */
    if (fclose (fp) == EOF) {
        perror ("writebuf2file-fclose");
        return 0;
    }
    
    return nbytes;
}

/* check if first 4-bytes in buf match jpg header */
int chkjpgheader (const unsigned char *buf)
{
    return  buf[0] == 0xff && 
            buf[1] == 0xd8 && 
            buf[2] == 0xff && 
            buf[3] >> 4 == 0xe;
}

/* find each jpg header and write contents to separate file_000x.jpg files.
 * returns the number of jpg files successfully recovered.
 */
int recoverjpgs (FILE *ifp)
{
    char jpgname[FNLEN] = "";                   /* jpg output filename */
    int jpgcnt = 0;                             /* found jpg header count*/
    size_t  nbytes,                             /* no. of bytes read/written */
            bufsz = JPGSZ,                      /* tracks current allocation of buf */
            total = 0;                          /* tracks total bytes in jpg file */
    uint8_t *buf = malloc (JPGSZ);              /* read buffer */
    
    if (!buf) { /* validate every allocation/reallocation */
        perror ("malloc-buf");
        return 0;
    }
    
    /* read until jpg header found */
    while ((nbytes = fread (buf + total, 1, BLKSZ, ifp)) > 0) {
        /* check if jpg header found */
        if (nbytes >= 4 && chkjpgheader(buf + total)) {
            /* if not 1st header, write buffer to file, reset for next file */
            if (jpgcnt) {
                /* create output filename (e.g. file_0001.jpg) */
                sprintf (jpgname, "file_%04d.jpg", jpgcnt);
                /* write current buf to file */
                if (!writebuf2file (jpgname, buf, total))
                    return jpgcnt - 1;
                /* move header block to start of buf */
                memmove (buf, buf + total, BLKSZ);
                total = 0;                  /* reset total for next file */
            }
            jpgcnt += 1;    /* increment recovered jpg count */
        }
        /* if header found - began accumulating blocks in buf */
        if (jpgcnt)
            total += nbytes;
        /* check if reallocation required before next read */
        if (total + BLKSZ > bufsz) {
            /* add a fixed 32K each time reallocaiton required
             * always realloc to a temporary pointer to prevent memory leak
             * on realloc failure.
             */
            void *tmp = realloc (buf, bufsz + (1 << 15));
            if (!tmp) {                     /* validate every reallocations */
                perror ("realloc-buf");
                return jpgcnt - 1;
            }
            buf = tmp;              /* assign reallocated block to buf */
            bufsz += 1 << 15;       /* update bufsz with new allocation size */
        }
    }
    /* write final buffer to file */
    if (jpgcnt) {
        /* create output filename (e.g. file_0001.jpg) */
        sprintf (jpgname, "file_%04d.jpg", jpgcnt);
        /* write current buf to file */
        if (!writebuf2file (jpgname, buf, total))
            return jpgcnt - 1;
    }
    
    free (buf);     /* free allocated memory */
    
    return jpgcnt;  /* return number of jpg files recovered */
}

int main (int argc, char **argv) {
    
    FILE *fp = NULL;            /* input file stream pointer */
    int jpgcnt = 0;             /* count of jpg files recovered */
    
    if (argc < 2 ) {    /* validate 1 argument given for filename */
        fprintf (stderr, "error: insufficient input,\n"
                         "usage: %s filename\n", argv[0]);
        return 1;
    }
    
    /* open file/validate file open for reading */
    if ((fp = fopen (argv[1], "rb")) == NULL) {
        perror ("fopen-argv[1]");
        return 1;
    }
    
    if ((jpgcnt = recoverjpgs(fp)))
        printf ("recovered %d .jpg files.\n", jpgcnt);
    else
        puts ("no jpg files recovered.");
        
    fclose (fp);
}

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

It is imperative that you use a memory error checking program to ensure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

Take your time and go though the code. Let me know if you have further questions.

来源：https://stackoverflow.com/questions/62605698/cs50-pset4-recover-why-does-writing-entire-file-to-memory-fail-check50

标签

cs50