binning

pd.qcut with values that are inf (infinity) ValueError: Bin edges must be unique:

て烟熏妆下的殇ゞ 提交于 2019-12-14 02:21:26
问题 I have a data set that is a ratio of 2 float type numbers. Some values have inf for infinity (divide by zero) situation. How do I work with pd.qcut/pd.cut with inf values? My data can be accessed here. q = pd.qcut(df['ratio'], 10) ValueError: Bin edges must be unique: array([ 1.20089207e+03, 6.02984295e+04, 1.26445577e+05, 2.29982770e+05, 5.13176079e+05, 1.28794976e+06, 4.96001538e+06, nan, nan, nan, inf]) 回答1: you could replace the np.inf with np.nan then dropna q = pd.qcut(df.ratio.replace

Conditionally binning

瘦欲@ 提交于 2019-12-12 21:08:50
问题 Is it possible to create a new column in a dataframe where the bins for 'X' are based on a value of another column(s). Example below. The bins for AR1, PO1 and RU1 are different from one another. Until now I can only get bins for all values in 'X'. import pandas as pd import numpy as np import string import random N = 100 J = [2012,2013,2014] K = ['A','B','C','D','E','F','G','H'] L = ['h','d','a'] S = ['AR1','PO1','RU1'] np.random.seed(0) df = pd.DataFrame( {'X': np.random.uniform(1,10,N), 'Y

Pandas - Group/bins of data per longitude/latitude

陌路散爱 提交于 2019-12-12 07:55:15
问题 I have a bunch of geographical data as below. I would like to group the data by bins of .2 degrees in longitude AND .2 degree in latitude. While it is trivial to do for either latitude or longitude, what is the most appropriate of doing this for both variables? |User_ID |Latitude |Longitude|Datetime |u |v | |---------|----------|---------|-------------------|-----|-----| |222583401|41.4020375|2.1478710|2014-07-06 20:49:20|0.3 | 0.2 | |287280509|41.3671346|2.0793115|2013-01-30 09:25:47|0.2 | 0

Binning variables in a dataframe with input bin data from another dataframe

本秂侑毒 提交于 2019-12-12 05:25:59
问题 Being a beginner-level user of R , despite having read (1) numerous posts about binning&grouping here at SO, and (2) documentation on data.table and dplyr packages, I still can't figure out how to apply the power of those packages for binning continuous&factor variables, for further use in credit scoring modelling. Problem: To build a code-efficient, easily-customisable, more or less automated solution for binning variables with minimal hard-coding. These variables used to be binned with a

How to catch the index of immediate greater number in other matrix?

你。 提交于 2019-12-11 22:27:18
问题 Consider example a=rand(5,1) b=rand(5,1); bs=sum(b); B=b./bs; cB=cumsum(B) %OUTPUT a = 0.7803 0.3897 0.2417 0.4039 0.0965 cB = 0.0495 0.4030 0.7617 0.9776 1.0000 now i want the position of the number in cB which is immediately greater than the number in a. that is to say i want 5 positions corresponding to each number in a. So my output should be P= [4;2;2;3;2] Please help. 回答1: The suggestions by the others are decent, but both miss the point as they are inefficient for large problems. This

Convert year-month string column into quarterly bins

痴心易碎 提交于 2019-12-11 12:25:51
问题 I am currently working with a large phenology data set, where there are multiple observations of trees for a given month. I want to assign these observations into three month clusters or bins. I am currently using the following code: Cluster.GN <- ifelse(Master.feed.parts.gn$yr.mo=="2007.1", 1, ifelse(Master.feed.parts.gn$yr.mo=="2007.11", 1,.... ifelse(Master.feed.parts.gn$yr.mo=="2014.05", 17, NA) This code works, but it is very cumbersome as there are over 50 months. I have had trouble

How to bin x,y,z vectors into matrix (R)

孤街浪徒 提交于 2019-12-11 07:56:49
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 5 years ago . This is probably a basic question, but I haven't been able to Google anything helpful after trying for days. I have an R dataframe with x,y,z tuples, where z is a response to x and y and can be modeled as a surface. > head(temp) x y z 1 36.55411 965.7779 1644.779 2 42.36912 978.9721 1643.957 3 58.34699 1183.7426 1846.123 4 53.55439 1232.2696 1990.707 5 50.76167 1115.2049 1281

How to handle NaNs in binning with numpy add.reduceat?

荒凉一梦 提交于 2019-12-11 06:48:07
问题 I'm using the numpy reduceat method for binning data. Background: I'm processing measurement data sampled at high frequencies and I need to down-sample them by extracting bin means from bins of a certain size. Since I have millions of samples, I need something fast. In principle, this works like a charm: import numpy as np def bin_by_npreduceat(v, nbins): bins = np.linspace(0, len(v), nbins+1, True).astype(np.int) return np.add.reduceat(v, bins[:-1]) / np.diff(bins) The Problem is: NaNs can

How do I get identical results from the old hist and the new histcounts functions in Matlab

天涯浪子 提交于 2019-12-11 04:46:43
问题 I am trying to replace a use of the old hist function by the new histcounts which performs faster at binning and counting than hist. However, I am struggling to achieve the exact same results from histcounts that I got from hist. I am aware that histcounts returns the bin edges rather than the binCenters. However, the counts should be identical and the bin edges should be convertible to bin centers, as far as I understand. The Matlab reference page for replacing hist with histcounts (see http

Convert year-month string to three month bins with gaps - how to assign contiguous ascending values?

守給你的承諾、 提交于 2019-12-11 04:15:38
问题 I have used the code below to "bin" a year.month string into three month bins. The problem is that I want each of the bins to have a number that corresponds where the bin occurs chronologically (i.e. first bin =1, second bin=2, etc.). Right now, the first month bin is assigned to the number 4, and I am not sure why. Any help would be highly appreciated! > head(Master.feed.parts.gn$yr.mo, n=20) [1] "2007.10" "2007.10" "2007.10" "2007.11" "2007.11" "2007.11" "2007.11" "2007.12" "2008.01" [10]