binning | 易学教程

Mathematica fast 2D binning algorithm

阅读更多关于 Mathematica fast 2D binning algorithm

I am having some trouble developing a suitably fast binning algorithm in Mathematica. I have a large (~100k elements) data set of the form T={{x1,y1,z1},{x2,y2,z2},....} and I want to bin it into a 2D array of around 100x100 bins, with the bin value being given by the sum of the Z values that fall into each bin. Currently I am iterating through each element of the table, using Select to pick out which bin it is supposed to be in based on lists of bin boundaries, and adding the z value to a list of values occupying that bin. At the end I map Total onto the list of bins, summing their contents

resize with averaging or rebin a numpy 2d array

阅读更多关于 resize with averaging or rebin a numpy 2d array

问题 I am trying to reimplement in python an IDL function: http://star.pst.qub.ac.uk/idl/REBIN.html which downsizes by an integer factor a 2d array by averaging. For example: >>> a=np.arange(24).reshape((4,6)) >>> a array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]]) I would like to resize it to (2,3) by taking the mean of the relevant samples, the expected output would be: >>> b = rebin(a, (2, 3)) >>> b array([[ 3.5, 5.5, 7.5], [ 15.5, 17.5, 19

Pandas pd.cut() - binning datetime column / series

阅读更多关于 Pandas pd.cut() - binning datetime column / series

问题 Attempting to do a bin using pd.cut() but it is fairly elaborate- A collegue sends me multiple files with report dates such as: '03-16-2017 to 03-22-2017' '03-23-2017 to 03-29-2017' '03-30-2017 to 04-05-2017' They are all combined into a single dataframe and given a column name, df['Filedate'] so that every record in the file has the correct filedate. The last day is a cutoff point, so I created a new column df['Filedate_bin'] which converts the last day to 3/22/2017, 3/29/2017, 4/05/2017 as

Binning data in R

阅读更多关于 Binning data in R

I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins). v<-c(1:4000) V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals Use cut and tapply : > tapply(v, cut(v, 60), median) (-3,67.7] (67.7,134] (134,201] (201,268] 34.0 101.0 167.5 234.0 (268,334] (334,401] (401,468] (468,534] 301.0 367.5 434.0 501.0 (534,601] (601,668] (668,734] (734,801] 567.5 634.0 701.0 767.5 (801,867] (867,934] (934,1e+03] (1e+03,1.07e+03] 834

Ternary heatmap in R

阅读更多关于 Ternary heatmap in R

I'm trying to come up with a way of plotting a ternary heatmap using R. I think ggtern should be able todo the trick, but I don't know how to do a binning function like stat_bin in vanilla ggplot2. Here's What I have so far: require(ggplot2) require(ggtern) require(MASS) require(scales) palette <- c( "#FF9933", "#002C54", "#3375B2", "#CCDDEC", "#BFBFBF", "#000000") sig <- matrix(c(1,2,3,4),2,2) data <- data.frame(mvrnorm(n=10000, rep(2, 2), Sigma)) data$X1 <- data$X1/max(data$X1) data$X2 <- data$X2/max(data$X2) data$X1[which(data$X1<0)] <- runif(length(data$X1[which(data$X1<0)])) data$X2[which

Better binning in pandas

阅读更多关于 Better binning in pandas

I've got a data frame and want to filter or bin by a range of values and then get the counts of values in each bin. Currently, I'm doing this: x = 5 y = 17 z = 33 filter_values = [x, y, z] filtered_a = df[df.filtercol <= x] a_count = filtered_a.filtercol.count() filtered_b = df[df.filtercol > x] filtered_b = filtered_b[filtered_b <= y] b_count = filtered_b.filtercol.count() filtered_c = df[df.filtercol > y] c_count = filtered_c.filtercol.count() But is there a more concise way to accomplish the same thing? Perhaps you are looking for pandas.cut : import pandas as pd import numpy as np df = pd

Reduce number of levels for large categorical variables

阅读更多关于 Reduce number of levels for large categorical variables

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) # Decide the minimum frequency a level needs... min

Bin pandas dataframe by every X rows

阅读更多关于 Bin pandas dataframe by every X rows

I have a simple dataframe which I would like to bin for every 3 rows. It looks like this: col1 0 2 1 1 2 3 3 1 4 0 and I would like to turn it into this: col1 0 2 1 0.5 I have already posted a similar question here but I have no Idea how to port the solution to my current use case. Can you help me out? Many thanks! >>> df.groupby(df.index / 3).mean() col1 0 2.0 1 0.5 ojunk The answer from Roman Pekar was not working for me. I imagine that this is because of differences between Python2 and Python3 . This worked for me in Python3 : >>> df.groupby(df.index // 3).mean() col1 0 2.0 1 0.5 For Python

R calculate the average of one column corresponding to each bin of another column [duplicate]

阅读更多关于 R calculate the average of one column corresponding to each bin of another column [duplicate]

问题 This question already has an answer here: R aggregate data in one column based on 2 other columns 1 answer I have these data that has two columns. As you can see in the graph, the data has too much noise. So, I want to discretize column "r" with size 5, and assign each row to its corresponding bin, then calculate the average of f for each bin. > dr r f 1 65.06919 21.796 2 62.36986 22.836 3 59.81639 22.980 4 57.42822 22.061 5 55.22681 21.012 6 53.23533 21.274 7 51.47815 21.594 8 49.98000 22

Binning data in R

阅读更多关于 Binning data in R

问题 I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins). v<-c(1:4000) V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals 回答1: Use cut and tapply : > tapply(v, cut(v, 60), median) (-3,67.7] (67.7,134] (134,201] (201,268] 34.0 101.0 167.5 234.0 (268,334] (334,401] (401,468] (468,534] 301.0 367.5 434.0 501.0 (534,601