binning | 易学教程

r data.table usage in function call

阅读更多关于 r data.table usage in function call

问题 I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work Without a function call this works just fine: # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) #

Collapse/mean data in Matlab with respect to a different set of data

阅读更多关于 Collapse/mean data in Matlab with respect to a different set of data

问题 I have two sets of data, but the sets have a different sizes. Each set contains the measurements itself (MeasA and MeasB, both double) and the time point (TimeA and TimeB, datenum or julian date) when the measuring happened. Now I want to match the smaller data set to the bigger one, and to do this, I want to mean the data points of the bigger set around the data resp. time points of the smaller set, to finally do some correlation analysis. Edit: Small Example how the data would look like:

pd.qcut - ValueError: Bin edges must be unique

阅读更多关于 pd.qcut - ValueError: Bin edges must be unique

问题 My data is here. q = pd.qcut(df['loss_percent'], 10) ValueError: Bin edges must be unique: array([ 0.38461538, 0.38461538, 0.46153846, 0.46153846, 0.53846154, 0.53846154, 0.53846154, 0.61538462, 0.69230769, 0.76923077, 1. ]) I have read through why-use-pandas-qcut-return-valueerror, however I am still confused. I imagine that one of my values has a high frequency of occurrence and that is breaking qcut. First, step is how do I determine if that is indeed the case, and which value is the

How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

阅读更多关于 How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

问题 I have a Nx3 array mm . The function call c,edg,idx = scipy.stats.binned_statistic_dd(mm,[], statistic='count',bins=(30,20,10),rg=((3,5),(2,8),(4,6))) returns idx , which is a 1d array of ints that represents the bin in which each element of mm falls, and edg is a list of 3 arrays holding the bin edges What I need is to find the bin edges of a given bin given it's binnumber in idx. For example, given idx =[24,153,...,72] I want to find the edges of say bin 153, i.e. where that bin falls in

fit a function to a histogram created with frequency in gnuplot

阅读更多关于 fit a function to a histogram created with frequency in gnuplot

问题 Intro In gnuplot there's a solution to create histogram from file named hist.dat what likes 1 2 2 2 3 by using commands binwidth=1 set boxwidth binwidth bin(x,width)=width*floor(x/width) + binwidth/2.0 plot [0:5][0:*] "hist.dat" u (bin($1,binwidth)):(1.0) smooth freq with boxes that generates a histogram like this one from other SO page. Question How can I fit my function to this histogram? I defined a Gaussian function and initialized its values by f(x) = a*exp(-((x-m)/s)**2) a=3; m=2.5; s=1

How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

阅读更多关于 How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

I have a Nx3 array mm . The function call c,edg,idx = scipy.stats.binned_statistic_dd(mm,[], statistic='count',bins=(30,20,10),rg=((3,5),(2,8),(4,6))) returns idx , which is a 1d array of ints that represents the bin in which each element of mm falls, and edg is a list of 3 arrays holding the bin edges What I need is to find the bin edges of a given bin given it's binnumber in idx. For example, given idx =[24,153,...,72] I want to find the edges of say bin 153, i.e. where that bin falls in edg . Of course I can find the elements in bin 153 by mm[153], but not the edges. I posted this Nx3 case

pd.qcut - ValueError: Bin edges must be unique

阅读更多关于 pd.qcut - ValueError: Bin edges must be unique

My data is here . q = pd.qcut(df['loss_percent'], 10) ValueError: Bin edges must be unique: array([ 0.38461538, 0.38461538, 0.46153846, 0.46153846, 0.53846154, 0.53846154, 0.53846154, 0.61538462, 0.69230769, 0.76923077, 1. ]) I have read through why-use-pandas-qcut-return-valueerror , however I am still confused. I imagine that one of my values has a high frequency of occurrence and that is breaking qcut. First, step is how do I determine if that is indeed the case, and which value is the problem. Lastly, what kind of solution is appropriate given my data. piRSquared Using the solution in the

Binning longitude/latitude labeled data by census block ID

阅读更多关于 Binning longitude/latitude labeled data by census block ID

问题 I have two data sets, one for crime in Chicago, labeled with longitude and latitude coords and a shapefile of census blocks also in Chicago. Is it possible in R to aggregate crimes within census blocks, given these two files? The purpose is to be able to map out the crimes by census block. Location for download of Chicago census tract data: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Blocks-2000/uktd-fzhd Location for download of crime data: https://data

Binning time series in R?

阅读更多关于 Binning time series in R?

问题 I'm new to R. My data has 600k objects defined by three attributes: Id , Date and TimeOfCall . TimeofCall has a 00:00:00 format and range from 00:00:00 to 23:59:59 . I want to bin the TimeOfCall attribute, into 24 bins, each one representing hourly slot (first bin 00:00:00 to 00:59:59 and so on). Can someone talk me through how to do this? I tried using cut() but apparently my format is not numeric. Thanks in advance! 回答1: While you could convert to a formal time representation, in this case

R - faster alternative to hist(XX, plot=FALSE)$count

阅读更多关于 R - faster alternative to hist(XX, plot=FALSE)$count

I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count function as I don't need any of the other output that is produced (as I want to use it in an sapply call, requiring 1 million iterations in which this function would be called), e.g. x = runif(100000000, 2.5, 2.6) bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count Any thoughts? A first attempt using table and cut : table(cut(x, breaks=seq(0,3,length.out=100))) It avoids the extra output, but takes about 34 seconds on my computer: system.time(table(cut(x, breaks=seq(0,3,length.out=100)))