I am working on an ASP web page that handles file uploads. Only certain types of files are allowed to be uploaded, like .XLS, .XML, .CSV, .TXT, .PDF, .PPT, etc.
I
I know you said C#, but this could maybe be ported. Also, it has an XML file already containing many descriptors for common file types.
It's a Java library called JMimeMagic. It's here: http://jmimemagic.sourceforge.net/
One way would be to check for certain signatures or magic numbers in the files. This page has a handy list of known file signatures and seems quite up to date:
http://www.garykessler.net/library/file_sigs.html
In other words if a trojan.exe was renamed to harmless.pdf and uploaded, the application must be able to find out that the uploaded file is NOT a .PDF file.
That's not really a problem. If a .exe was uploaded as a .pdf and you correctly served it back up to the downloader as application/pdf, all the downloader would get would be a broken PDF. They would have to manually retype it to .exe to get harmed.
The real problems are:
Some browsers may sniff the content of the file and decide they know better than you about what type of file it is. IE is particularly bad at this, tending to prefer to render the file as HTML if it sees any HTML tags lurking near the start of the file. This is particulary unhelpful as it means script can be injected onto your site, potentially compromising any application-level security (cookie stealing et al). Workarounds include always serving the file as an attachment using Content-Disposition, and/or serving files from a different hostname, so it can't cross-site-script back onto your main site.
PDF files are not safe anyway! They can be full of scripting, and have had significant security holes. Exploitation of a hole in the PDF reader browser plugin is currently one of the most common means of installing trojans on the web. And there's almost nothing you can usually do to try to detect the exploits as they can be highly obfuscated.
Get the file headers of the "safe" file types - executables always have their own types of headers, and you can probably detect those. You'd have to be familiar with every format that you intend to accept, however.
On **NIX* systems we have an utility called file(1). Try to find something similar for Windows, but the file utility if self has been ported.
The following C++ code could help you:
//-1 : File Does not Exist or no access
//0 : not an office document
//1 : (General) MS office 2007
//2 : (General) MS office older than 2007
//3 : MS office 2003 PowerPoint presentation
//4 : MS office 2003 Excel spreadsheet
//5 : MS office applications or others
int IsOffice2007OrOlder(wchar_t * fileName)
{
int iRet = 0;
byte msgFormatChk2007[8] = {0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00}; //offset 0 for office 2007 documents
byte possibleMSOldOffice[8] = {0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1}; //offset 0 for possible office 2003 documents
byte msgFormatChkXLSPPT[4] = {0xFD, 0xFF, 0xFF, 0xFF}; // offset 512: xls, ppt: FD FF FF FF
byte msgFormatChkOnlyPPT[4] = {0x00, 0x6E, 0x1E, 0xF0}; // offset 512: another ppt offset PPT
byte msgFormatChkOnlyDOC[4] = {0xEC, 0xA5, 0xC1, 0x00}; //offset 512: EC A5 C1 00
byte msgFormatChkOnlyXLS[8] = {0x09, 0x08, 0x10, 0x00, 0x00, 0x06, 0x05, 0x00}; //offset 512: XLS
int iMsgChk = 0;
HANDLE fileHandle = CreateFile(fileName, GENERIC_READ,
FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_READONLY, NULL );
if(INVALID_HANDLE_VALUE == fileHandle)
{
return -1;
}
byte buff[20];
DWORD bytesRead;
iMsgChk = 1;
if(0 == ReadFile(fileHandle, buff, 8, &bytesRead, NULL ))
{
return -1;
}
if(buff[0] == msgFormatChk2007[0])
{
while(buff[iMsgChk] == msgFormatChk2007[iMsgChk] && iMsgChk < 9)
iMsgChk++;
if(iMsgChk >= 8) {
iRet = 1; //office 2007 file format
}
}
else if(buff[0] == possibleMSOldOffice[0])
{
while(buff[iMsgChk] == possibleMSOldOffice[iMsgChk] && iMsgChk < 9)
iMsgChk++;
if(iMsgChk >= 8)
{
//old office file format, check 512 offset further in order to filter out real office format
iMsgChk = 1;
SetFilePointer(fileHandle, 512, NULL, FILE_BEGIN);
if(ReadFile(fileHandle, buff, 8, &bytesRead, NULL ) == 0) { return 0; }
if(buff[0] == msgFormatChkXLSPPT[0])
{
while(buff[iMsgChk] == msgFormatChkXLSPPT[iMsgChk] && iMsgChk < 5)
iMsgChk++;
if(iMsgChk == 4)
iRet = 2;
}
else if(buff[iMsgChk] == msgFormatChkOnlyDOC[iMsgChk])
{
while(buff[iMsgChk] == msgFormatChkOnlyDOC[iMsgChk] && iMsgChk < 5)
iMsgChk++;
if(iMsgChk == 4)
iRet = 2;
}
else if(buff[0] == msgFormatChkOnlyPPT[0])
{
while(buff[iMsgChk] == msgFormatChkOnlyPPT[iMsgChk] && iMsgChk < 5)
iMsgChk++;
if(iMsgChk == 4)
iRet = 3;
}
else if(buff[0] == msgFormatChkOnlyXLS[0])
{
while(buff[iMsgChk] == msgFormatChkOnlyXLS[iMsgChk] && iMsgChk < 9)
iMsgChk++;
if(iMsgChk == 9)
iRet = 4;
}
if(0 == iRet){
iRet = 5;
}
}
}
CloseHandle(fileHandle);
return iRet;
}