binning

r data.table usage in function call

淺唱寂寞╮ 提交于 2019-12-10 22:59:25
问题 I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work Without a function call this works just fine: # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) #

Collapse/mean data in Matlab with respect to a different set of data

不打扰是莪最后的温柔 提交于 2019-12-10 12:04:28
问题 I have two sets of data, but the sets have a different sizes. Each set contains the measurements itself (MeasA and MeasB, both double) and the time point (TimeA and TimeB, datenum or julian date) when the measuring happened. Now I want to match the smaller data set to the bigger one, and to do this, I want to mean the data points of the bigger set around the data resp. time points of the smaller set, to finally do some correlation analysis. Edit: Small Example how the data would look like:

pd.qcut - ValueError: Bin edges must be unique

六眼飞鱼酱① 提交于 2019-12-10 11:38:57
问题 My data is here. q = pd.qcut(df['loss_percent'], 10) ValueError: Bin edges must be unique: array([ 0.38461538, 0.38461538, 0.46153846, 0.46153846, 0.53846154, 0.53846154, 0.53846154, 0.61538462, 0.69230769, 0.76923077, 1. ]) I have read through why-use-pandas-qcut-return-valueerror, however I am still confused. I imagine that one of my values has a high frequency of occurrence and that is breaking qcut. First, step is how do I determine if that is indeed the case, and which value is the

How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

家住魔仙堡 提交于 2019-12-08 05:11:18
问题 I have a Nx3 array mm . The function call c,edg,idx = scipy.stats.binned_statistic_dd(mm,[], statistic='count',bins=(30,20,10),rg=((3,5),(2,8),(4,6))) returns idx , which is a 1d array of ints that represents the bin in which each element of mm falls, and edg is a list of 3 arrays holding the bin edges What I need is to find the bin edges of a given bin given it's binnumber in idx. For example, given idx =[24,153,...,72] I want to find the edges of say bin 153, i.e. where that bin falls in

fit a function to a histogram created with frequency in gnuplot

ⅰ亾dé卋堺 提交于 2019-12-07 07:51:15
问题 Intro In gnuplot there's a solution to create histogram from file named hist.dat what likes 1 2 2 2 3 by using commands binwidth=1 set boxwidth binwidth bin(x,width)=width*floor(x/width) + binwidth/2.0 plot [0:5][0:*] "hist.dat" u (bin($1,binwidth)):(1.0) smooth freq with boxes that generates a histogram like this one from other SO page. Question How can I fit my function to this histogram? I defined a Gaussian function and initialized its values by f(x) = a*exp(-((x-m)/s)**2) a=3; m=2.5; s=1

How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?

自闭症网瘾萝莉.ら 提交于 2019-12-06 15:49:07
I have a Nx3 array mm . The function call c,edg,idx = scipy.stats.binned_statistic_dd(mm,[], statistic='count',bins=(30,20,10),rg=((3,5),(2,8),(4,6))) returns idx , which is a 1d array of ints that represents the bin in which each element of mm falls, and edg is a list of 3 arrays holding the bin edges What I need is to find the bin edges of a given bin given it's binnumber in idx. For example, given idx =[24,153,...,72] I want to find the edges of say bin 153, i.e. where that bin falls in edg . Of course I can find the elements in bin 153 by mm[153], but not the edges. I posted this Nx3 case

pd.qcut - ValueError: Bin edges must be unique

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 15:45:10
My data is here . q = pd.qcut(df['loss_percent'], 10) ValueError: Bin edges must be unique: array([ 0.38461538, 0.38461538, 0.46153846, 0.46153846, 0.53846154, 0.53846154, 0.53846154, 0.61538462, 0.69230769, 0.76923077, 1. ]) I have read through why-use-pandas-qcut-return-valueerror , however I am still confused. I imagine that one of my values has a high frequency of occurrence and that is breaking qcut. First, step is how do I determine if that is indeed the case, and which value is the problem. Lastly, what kind of solution is appropriate given my data. piRSquared Using the solution in the

Binning longitude/latitude labeled data by census block ID

怎甘沉沦 提交于 2019-12-06 14:58:57
问题 I have two data sets, one for crime in Chicago, labeled with longitude and latitude coords and a shapefile of census blocks also in Chicago. Is it possible in R to aggregate crimes within census blocks, given these two files? The purpose is to be able to map out the crimes by census block. Location for download of Chicago census tract data: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Blocks-2000/uktd-fzhd Location for download of crime data: https://data

Binning time series in R?

一世执手 提交于 2019-12-06 13:00:49
问题 I'm new to R. My data has 600k objects defined by three attributes: Id , Date and TimeOfCall . TimeofCall has a 00:00:00 format and range from 00:00:00 to 23:59:59 . I want to bin the TimeOfCall attribute, into 24 bins, each one representing hourly slot (first bin 00:00:00 to 00:59:59 and so on). Can someone talk me through how to do this? I tried using cut() but apparently my format is not numeric. Thanks in advance! 回答1: While you could convert to a formal time representation, in this case

R - faster alternative to hist(XX, plot=FALSE)$count

时光总嘲笑我的痴心妄想 提交于 2019-12-06 03:46:47
I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count function as I don't need any of the other output that is produced (as I want to use it in an sapply call, requiring 1 million iterations in which this function would be called), e.g. x = runif(100000000, 2.5, 2.6) bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count Any thoughts? A first attempt using table and cut : table(cut(x, breaks=seq(0,3,length.out=100))) It avoids the extra output, but takes about 34 seconds on my computer: system.time(table(cut(x, breaks=seq(0,3,length.out=100)))