information-theory | 易学教程

Is there an algorithm for “perfect” compression?

阅读更多关于 Is there an algorithm for “perfect” compression?

问题 Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy. I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but

Entropy and Information Gain

阅读更多关于 Entropy and Information Gain

问题 Simple question I hope. If I have a set of data like this: Classification attribute-1 attribute-2 Correct dog dog Correct dog dog Wrong dog cat Correct cat cat Wrong cat dog Wrong cat dog Then what is the information gain of attribute-2 relative to attribute-1? I've computed the entropy of the whole data set: -(3/6)log2(3/6)-(3/6)log2(3/6)=1 Then I'm stuck! I think you need to calculate entropies of attribute-1 and attribute-2 too? Then use these three calculations in an information gain

How to compute the shannon entropy and mutual information of N variables

阅读更多关于 How to compute the shannon entropy and mutual information of N variables

问题 I need to compute the mutual information, and so the shannon entropy of N variables. I wrote a code that compute shannon entropy of certain distribution. Let's say that I have a variable x, array of numbers. Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it. import scipy.integrate as scint from numpy import* from scipy import* def shannon_entropy(a, bins): p,binedg= histogram(a,bins,normed

Measuring how a new sample contributes to the diversity of a dataset

阅读更多关于 Measuring how a new sample contributes to the diversity of a dataset

问题 I am working with grayscale images dataset. Is there a way to determine a new grayscale image can contribute to the diversity of a greyscale images dataset? I would like to prevent the dataset of having too many similar samples. 回答1: Well, what do you see when you look at it? If you have information about the images in this dataset, you yourself can probably assess whether this new sample is a repetition of some pattern that is already included in the dataset, or if it is something unique.

Calculating Mutual Information For Selecting a Training Set in Java

阅读更多关于 Calculating Mutual Information For Selecting a Training Set in Java

问题 Scenario I am attempting to implement supervised learning over a data set within a Java GUI application. The user will be given a list of items or 'reports' to inspect and will label them based on a set of available labels. Once the supervised learning is complete, the labelled instances will then be given to a learning algorithm. This will attempt to order the rest of the items on how likely it is the user will want to view them. To get the most from the user's time I want to pre-select the

Computing information content in Python

阅读更多关于 Computing information content in Python

问题 I need to compute information content from two Python lists. I understand that I can use the following formula where the probabilities are computed from the histograms of the list. Information content = sum_ij p(x_i,y_j) log_2 ( p(x_i,y_j)/(p(x_i)p(y_j) ) / - sum_i p(y_i) log_2 p(y_i) . Is there any built in Python API to compute information content? Thanks. 回答1: Check out the information_content function in the biopython library: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc303

What's the most that GZIP or DEFLATE can increase a file size?

阅读更多关于 What's the most that GZIP or DEFLATE can increase a file size?

It's well known that GZIP or DEFLATE (or any compression mechanism) can increase file size sometimes. Is there a maximum (either percentage or constant) that a file can be increased? What is it? If a file is X bytes, and I'm going to gzip it, and I need to budget for file space in advance - what's the worst case scenario? UPDATE: There are two overheads: GZIP adds a header, typically 18 bytes but essentially arbitrarily long. What about DEFLATE? That can expand content by a multiplicative factor, which I don't know. Does anyone know what it is? gzip will add a header and trailer of at least 18

Algorithm for rating the monotonicity of an array (i.e. judging the “sortedness” of an array)

阅读更多关于 Algorithm for rating the monotonicity of an array (i.e. judging the “sortedness” of an array)

问题 EDIT : Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n) .) As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of

What is the computer science definition of entropy?

阅读更多关于 What is the computer science definition of entropy?

问题 I've recently started a course on data compression at my university. However, I find the use of the term "entropy" as it applies to computer science rather ambiguous. As far as I can tell, it roughly translates to the "randomness" of a system or structure. What is the proper definition of computer science "entropy"? 回答1: Entropy can mean different things: Computing In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses

What is the computer science definition of entropy?

阅读更多关于 What is the computer science definition of entropy?

I've recently started a course on data compression at my university. However, I find the use of the term "entropy" as it applies to computer science rather ambiguous. As far as I can tell, it roughly translates to the "randomness" of a system or structure. What is the proper definition of computer science "entropy"? Entropy can mean different things: Computing In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data. This randomness is often collected from hardware sources, either pre-existing ones