Compute entropy of different file extensions to find randomness of data?

北城余情 提交于 2021-01-28 07:12:41

问题


I have different file types which includes JPEG and jpg ,#mp3 ,#GIF,#MP4,#FLV, M4V ,exe, zip etc.

  1. Take data in blocks , something like 4k block size, find randomness

  2. Generate randomness score between 0 and 1

  3. Try to put classes according to the randomness score.

How can we find entropy of different types of files as mentioned above and can we scale each of the file score between 0 and 1.


回答1:


+1 very interesting question here few untested ideas just from top of my head right now:

  1. how about using correlation coefficient

    with uniformly random data of the same size (4K) ? and or using FFT first and then correlate ...

  2. Or compute statistical properties and infer from there somehow ...

    I know this is vague description but might be worth digging into it...

  3. use comprimation

    for example compress your 4K data with Huffman coding and infere your coefficient form the ratio between uncompressed and compressed sizes and may be use logarithmic scale ...

I think the most easy to implement and plausible results will be from 3th approach as Huffman coding and entropy are closely related.

[edit1] using Shannon information entropy

Your suggestion is even better than Hufman encoding (even those 2 are closely related). Using Shannon information entropy H with binary digits as base will return the average amount of bits per word of your data (after Hufman encoding) needed to represent your data. So from there going to score <0..1> just divide by bits per word...

Here small C++/VCL example for entropy computation on BYTEs:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
char txt0[]="text Text bla bla bla ..."; 
char txt1[]="abcdefghijklmnopqrstuvwxy";
char txt2[]="AAAAAAAbbbbbccccddddeeeff";
//---------------------------------------------------------------------------
double entropy(BYTE *dat,int siz)
    {
    if ((siz<=0)||(dat==NULL)) return 0.0;
    int i; double H=0.0,P[256],dp=1.0/siz,base=1.0/log(2.0);
    for (i=0;i<256;i++) P[i]=0.0;
    for (i=0;i<siz;i++) P[dat[i]]+=dp;
    for (i=0;i<256;i++)
        {
        if (P[i]==0.0) continue;    // skip non existing items
        if (P[i]==1.0) return 0.0;  // this means only 1 item type is present in data
        H-=P[i]*log(P[i])*base;     // shanon entropy (binary bits base)
        }
    return H;
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt0,entropy(txt0,sizeof(txt0)),entropy(txt0,sizeof(txt0))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt1,entropy(txt1,sizeof(txt1)),entropy(txt1,sizeof(txt1))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt2,entropy(txt2,sizeof(txt2)),entropy(txt2,sizeof(txt2))/8.0));
    }
//-------------------------------------------------------------------------

And results:

txt = "text Text bla bla bla ..." , H = 3.185667 , H/8 = 0.398208
txt = "abcdefghijklmnopqrstuvwxy" , H = 4.700440 , H/8 = 0.587555
txt = "AAAAAAAbbbbbccccddddeeeff" , H = 2.622901 , H/8 = 0.327863


来源:https://stackoverflow.com/questions/64781278/compute-entropy-of-different-file-extensions-to-find-randomness-of-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!