information-theory

Is there an algorithm for “perfect” compression?

佐手、 提交于 2019-12-20 23:32:14
问题 Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy. I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but

Entropy and Information Gain

感情迁移 提交于 2019-12-14 02:17:35
问题 Simple question I hope. If I have a set of data like this: Classification attribute-1 attribute-2 Correct dog dog Correct dog dog Wrong dog cat Correct cat cat Wrong cat dog Wrong cat dog Then what is the information gain of attribute-2 relative to attribute-1? I've computed the entropy of the whole data set: -(3/6)log2(3/6)-(3/6)log2(3/6)=1 Then I'm stuck! I think you need to calculate entropies of attribute-1 and attribute-2 too? Then use these three calculations in an information gain

How to compute the shannon entropy and mutual information of N variables

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-13 18:28:39
问题 I need to compute the mutual information, and so the shannon entropy of N variables. I wrote a code that compute shannon entropy of certain distribution. Let's say that I have a variable x, array of numbers. Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it. import scipy.integrate as scint from numpy import* from scipy import* def shannon_entropy(a, bins): p,binedg= histogram(a,bins,normed

Measuring how a new sample contributes to the diversity of a dataset

不问归期 提交于 2019-12-13 02:48:39
问题 I am working with grayscale images dataset. Is there a way to determine a new grayscale image can contribute to the diversity of a greyscale images dataset? I would like to prevent the dataset of having too many similar samples. 回答1: Well, what do you see when you look at it? If you have information about the images in this dataset, you yourself can probably assess whether this new sample is a repetition of some pattern that is already included in the dataset, or if it is something unique.

Calculating Mutual Information For Selecting a Training Set in Java

别等时光非礼了梦想. 提交于 2019-12-12 14:34:14
问题 Scenario I am attempting to implement supervised learning over a data set within a Java GUI application. The user will be given a list of items or 'reports' to inspect and will label them based on a set of available labels. Once the supervised learning is complete, the labelled instances will then be given to a learning algorithm. This will attempt to order the rest of the items on how likely it is the user will want to view them. To get the most from the user's time I want to pre-select the

Computing information content in Python

自作多情 提交于 2019-12-12 03:32:45
问题 I need to compute information content from two Python lists. I understand that I can use the following formula where the probabilities are computed from the histograms of the list. Information content = sum_ij p(x_i,y_j) log_2 ( p(x_i,y_j)/(p(x_i)p(y_j) ) / - sum_i p(y_i) log_2 p(y_i) . Is there any built in Python API to compute information content? Thanks. 回答1: Check out the information_content function in the biopython library: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc303

What's the most that GZIP or DEFLATE can increase a file size?

拥有回忆 提交于 2019-12-05 04:31:00
It's well known that GZIP or DEFLATE (or any compression mechanism) can increase file size sometimes. Is there a maximum (either percentage or constant) that a file can be increased? What is it? If a file is X bytes, and I'm going to gzip it, and I need to budget for file space in advance - what's the worst case scenario? UPDATE: There are two overheads: GZIP adds a header, typically 18 bytes but essentially arbitrarily long. What about DEFLATE? That can expand content by a multiplicative factor, which I don't know. Does anyone know what it is? gzip will add a header and trailer of at least 18

Algorithm for rating the monotonicity of an array (i.e. judging the “sortedness” of an array)

我们两清 提交于 2019-12-04 18:56:41
问题 EDIT : Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n) .) As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of

What is the computer science definition of entropy?

旧城冷巷雨未停 提交于 2019-12-03 00:19:08
问题 I've recently started a course on data compression at my university. However, I find the use of the term "entropy" as it applies to computer science rather ambiguous. As far as I can tell, it roughly translates to the "randomness" of a system or structure. What is the proper definition of computer science "entropy"? 回答1: Entropy can mean different things: Computing In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses

What is the computer science definition of entropy?

不打扰是莪最后的温柔 提交于 2019-12-02 15:42:35
I've recently started a course on data compression at my university. However, I find the use of the term "entropy" as it applies to computer science rather ambiguous. As far as I can tell, it roughly translates to the "randomness" of a system or structure. What is the proper definition of computer science "entropy"? Entropy can mean different things: Computing In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data. This randomness is often collected from hardware sources, either pre-existing ones