I\'m programming something that allows users to store documents and pictures on a webserver, to be stored and retrieved later. When users upload files to my server, PHP tells m
Many filetypes have "magic numbers" at the beginning of the file to identify them, You can read some bytes from the front of the file and compare them to a list of known magic numbers.
For an exact answer on how you could quickly do this in PHP, check out this question: How do I find the mime-type of a file with php?
Sort of. Most file types have some bytes reserved for marking them so that you don't have to rely on the extension. The site http://wotsit.org is a great resource for finding this out for a particular type.
If you are on a unix system, I believe that the file command doesn't rely on the extension, so you could shell out to it if you don't want to write the byte checking code.
For PNG (http://www.w3.org/TR/PNG-Rationale.html)
The first eight bytes of a PNG file always contain the following values:
(decimal) 137 80 78 71 13 10 26 10
(hexadecimal) 89 50 4e 47 0d 0a 1a 0a
(ASCII C notation) \211 P N G \r \n \032 \n
If you are only dealing with images, then getimagesize() should distinguish a valid image from a fake one.
$ php -r 'var_dump(getimagesize("b&n.jpg"));'
array(7) {
[0]=>
int(200)
[1]=>
int(200)
[2]=>
int(2)
[3]=>
string(24) "width="200" height="200""
["bits"]=>
int(8)
["channels"]=>
int(3)
["mime"]=>
string(10) "image/jpeg"
}
$ php -r 'var_dump(getimagesize("/etc/passwd"));'
bool(false)
A false value from getimagesize is not an image.
As a side note I ran into a similar problem where I had to do my own type checking. The front end interface to my application was done in flash. The files were being passed through flash to a php script. When I was attempting to do a MIME type check using php the type always returned was application/octetstream because it was coming from flash.
I had to implement a magic numbers type paradigm. I simply created an xml file that held the file type along with some defining patterns found within the beginning of the file. Once the file reached the server I did some pattern matching with the xml file and then accepted or rejected the file. I didn't noticed any real performance decrease either which I was expecting.
This is just a side note to anyone who may be using flash as there front end and trying to type check the file once it is uploaded.
As well as identifying the filetype, you might want to watch out for files with other files embedded or appended to them. This will unfortunately require a more indepth analysis of the file contents than just using "magic numbers".
For example, http://quantumrook.wordpress.com/2007/06/06/hide-a-rar-file-in-a-jpg-file/ (this particular type of data hiding can be easily worked around by loading and resaving into a new file the actual image data .. others will be more difficult.)