Compute entropy of different file extensions to find randomness of data?

问题

I have different file types which includes JPEG and jpg ,#mp3 ,#GIF,#MP4,#FLV, M4V ,exe, zip etc.

Take data in blocks , something like 4k block size, find randomness
Generate randomness score between 0 and 1
Try to put classes according to the randomness score.

How can we find entropy of different types of files as mentioned above and can we scale each of the file score between 0 and 1.

回答1:

+1 very interesting question here few untested ideas just from top of my head right now:

how about using correlation coefficient

with uniformly random data of the same size (4K) ? and or using FFT first and then correlate ...
Or compute statistical properties and infer from there somehow ...

I know this is vague description but might be worth digging into it...
use comprimation

for example compress your 4K data with Huffman coding and infere your coefficient form the ratio between uncompressed and compressed sizes and may be use logarithmic scale ...

I think the most easy to implement and plausible results will be from 3th approach as Huffman coding and entropy are closely related.

[edit1] using Shannon information entropy

Your suggestion is even better than Hufman encoding (even those 2 are closely related). Using Shannon information entropy H with binary digits as base will return the average amount of bits per word of your data (after Hufman encoding) needed to represent your data. So from there going to score <0..1> just divide by bits per word...

Here small C++/VCL example for entropy computation on BYTEs:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
char txt0[]="text Text bla bla bla ..."; 
char txt1[]="abcdefghijklmnopqrstuvwxy";
char txt2[]="AAAAAAAbbbbbccccddddeeeff";
//---------------------------------------------------------------------------
double entropy(BYTE *dat,int siz)
    {
    if ((siz<=0)||(dat==NULL)) return 0.0;
    int i; double H=0.0,P[256],dp=1.0/siz,base=1.0/log(2.0);
    for (i=0;i<256;i++) P[i]=0.0;
    for (i=0;i<siz;i++) P[dat[i]]+=dp;
    for (i=0;i<256;i++)
        {
        if (P[i]==0.0) continue;    // skip non existing items
        if (P[i]==1.0) return 0.0;  // this means only 1 item type is present in data
        H-=P[i]*log(P[i])*base;     // shanon entropy (binary bits base)
        }
    return H;
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt0,entropy(txt0,sizeof(txt0)),entropy(txt0,sizeof(txt0))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt1,entropy(txt1,sizeof(txt1)),entropy(txt1,sizeof(txt1))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt2,entropy(txt2,sizeof(txt2)),entropy(txt2,sizeof(txt2))/8.0));
    }
//-------------------------------------------------------------------------

And results:

txt = "text Text bla bla bla ..." , H = 3.185667 , H/8 = 0.398208
txt = "abcdefghijklmnopqrstuvwxy" , H = 4.700440 , H/8 = 0.587555
txt = "AAAAAAAbbbbbccccddddeeeff" , H = 2.622901 , H/8 = 0.327863

来源：https://stackoverflow.com/questions/64781278/compute-entropy-of-different-file-extensions-to-find-randomness-of-data

标签

python

algorithm

file-io

entropy