binning | 易学教程

Python: Checking to which bin a value belongs

阅读更多关于 Python: Checking to which bin a value belongs

问题 I have a list of values and a list of bin edges. Now I need to check for all values to what bin they belong to. Is there a more pythonic way than iterating over the values and then over the bins and checking if the value belongs to the current bin, like: my_list = [3,2,56,4,32,4,7,88,4,3,4] bins = [0,20,40,60,80,100] for i in my_list: for j in range(len(bins)): if bins(j) < i < bins(j+1): DO SOMETHING This doesn't look very pretty to me. Thanks! 回答1: Probably too late, but for future

What is the fastest way to count elements in an array?

阅读更多关于 What is the fastest way to count elements in an array?

In my models, one of the most repeated tasks to be done is counting the number of each element within an array. The counting is from a closed set, so I know there are X types of elements, and all or some of them populate the array, along with zeros that represent 'empty' cells. The array is not sorted in any way, and could by quite long (about 1M elements), and this task is done thousands of times during one simulation (which is also part of hundreds of simulations). The result should be a vector r of size X , so r(k) is the amount of k in the array. Example: For X = 9 , if I have the

Better binning in pandas

阅读更多关于 Better binning in pandas

问题 I've got a data frame and want to filter or bin by a range of values and then get the counts of values in each bin. Currently, I'm doing this: x = 5 y = 17 z = 33 filter_values = [x, y, z] filtered_a = df[df.filtercol <= x] a_count = filtered_a.filtercol.count() filtered_b = df[df.filtercol > x] filtered_b = filtered_b[filtered_b <= y] b_count = filtered_b.filtercol.count() filtered_c = df[df.filtercol > y] c_count = filtered_c.filtercol.count() But is there a more concise way to accomplish

Reduce number of levels for large categorical variables

阅读更多关于 Reduce number of levels for large categorical variables

问题 Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". 回答1: Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace

How to bin column of floats with pandas

阅读更多关于 How to bin column of floats with pandas

问题 This code was working until I upgrade my python 2.x to 3.x. I have a df consisting of 3 columns ipk1, ipk2, ipk3. ipk1, ipk2, ipk3 consisting of float numbers 0 - 4.0, I would like to bin them into string. The data looks something like this: ipk1 ipk2 ipk3 ipk4 ipk5 jk 0 3.25 3.31 3.31 3.31 3.34 P 1 3.37 3.33 3.36 3.33 3.41 P 2 3.41 3.47 3.59 3.55 3.60 P 3 3.23 3.10 3.05 2.98 2.97 L 4 3.24 3.40 3.22 3.23 3.25 L on python 2.x this code works but after I upgrade it into python 3 it isn't. Is

R code to categorize age into group/ bins/ breaks

阅读更多关于 R code to categorize age into group/ bins/ breaks

I am trying to categorize age into group so it will not be continuous. I have this code: data$agegrp(data$age>=40 & data$age<=49) <- 3 data$agegrp(data$age>=30 & data$age<=39) <- 2 data$agegrp(data$age>=20 & data$age<=29) <- 1 the above code is not working under survival package. It's giving me: invalid function in complex assignment Can you point me where the error is? data is the dataframe I am using. A5C1D2H2I1M1N2O1R2T1 I would use findInterval() here: First, make up some sample data set.seed(1) ages <- floor(runif(20, min = 20, max = 50)) ages # [1] 27 31 37 47 26 46 48 39 38 21 26 25 40

Define and apply custom bins on a dataframe

阅读更多关于 Define and apply custom bins on a dataframe

Using python I have created following data frame which contains similarity values: cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000 2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000 3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353 4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000 5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000 6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000 I am trying to write a R script to generate another data frame that

Getting data for histogram plot

阅读更多关于 Getting data for histogram plot

问题 Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query: select total, count(total) from faults GROUP BY total; The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL? Example: +-------+--------------+ | total | count(total) | +-------+--------------+ | 30 | 1 | | 31 | 2 | | 33 | 1 | |

Define and apply custom bins on a dataframe

阅读更多关于 Define and apply custom bins on a dataframe

问题 Using python I have created following data frame which contains similarity values: cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000 2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000 3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353 4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000 5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000 6 0.029 0.302 0.558 0

Pandas: convert categories to numbers

阅读更多关于 Pandas: convert categories to numbers

问题 Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to \'one-hot encodings\'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead. I\'m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below: [np.where(x) for x in df.cc.get_dummies().values] This is somewhat easier