missing-data

How to remove more than 2 consecutive NA's in a column?

梦想的初衷 提交于 2019-12-07 08:12:34
问题 I am new to R, In my data Frame I have col1("Timestamp"), col2("Values"). I have to remove rows of more than 2 consecutive NA in col2. My dataframe Looks like the below one, Timestamp | values -----------|-------- 2011-01-02 | 2 2011-01-03 | 3 2011-01-04 | NA 2011-01-05 | 1 2011-01-06 | NA 2011-01-07 | NA 2011-01-08 | 8 2011-01-09 | 6 2011-01-10 | NA 2011-01-11 | NA 2011-01-12 | NA 2011-01-13 | 2 I would like to remove more than 2 duplicate rows based on second column. Expected output -

How to replace missing values with group mode in Pandas?

我们两清 提交于 2019-12-06 16:49:27
I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds". df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0])) I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you! mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing

Obtain unstandardized factor scores from factor analysis in R

﹥>﹥吖頭↗ 提交于 2019-12-06 16:19:51
I'm conducting a factor analysis of several variables in R using factanal() (but am open to using other packages). I want to determine each case's factor score, but I want the factor scores to be unstandardized and on the original metric of the input variables. When I run the factor analysis and obtain the factor scores, they are standardized with a normal distribution of mean=0, SD=1, and are not on the original metric of the input variables. How can I obtain unstandardized factor scores that have the same metric as the input variables? Ideally, this would mean a similar mean, sd, range, and

Efficient solution for forward filling missing values in a pandas dataframe column?

孤街醉人 提交于 2019-12-06 15:48:06
I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment. df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]}) # desired output a b 1 1 1 1 2 2 2 2 2 2 Here are the three solutions that I've tried so far. # really slow solutions df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill')) df['b'] = df.groupby('a')['b'].fillna(method='ffill') # much faster solution, but more memory intensive and ugly all around tmp = df

filling in missing data in one data frame with info from another

江枫思渺然 提交于 2019-12-06 15:31:05
问题 There are two data set, A & B, as below: A <- data.frame(TICKER=c("00EY","00EY","00EY","00EY","00EY"), CUSIP=c(NA,NA,"48205A10","48205A10","48205A10"), OFTIC=c(NA,NA,"JUNO","JUNO","JUNO"), CNAME=c(NA,NA, "JUNO", "JUNO","JUNO"), ANNDATS=c("2015-01-13","2015-01-13","2015-01-13","2015-01-13","2015-01-13"), ANALYS=c(00076659,00105887,00153117,00148921,00086659), stringsAsFactors = F) B <- data.frame(TICKER=c("00EY","00EY","00EY","00EY"), CUSIP=c("48205A10","48205A10","48205A10","48205A10"), OFTIC

R - Plotting a line with missing NA values

别等时光非礼了梦想. 提交于 2019-12-06 13:17:31
I have the following data.frame, "subset" Time A B C 2016-10-07 06:16:46 NA NA 41 2016-10-07 06:26:27 40 39 42 2016-10-07 06:38:23 NA 40 NA 2016-10-07 06:41:06 42 42 44 2016-10-07 06:41:06 NA 42 44 2016-10-07 06:41:06 NA 42 44 2016-10-07 06:41:07 44 43 48 2016-10-07 06:41:41 NA 43 48 2016-10-07 06:42:44 45 42 48 2016-10-07 06:48:40 46 45 48 I would like to have a plot where "Time" is the x-axis, "A" is a line and "B" and "C" are points. However, when i plot this, the only line that appears for "A" is the one connecting the last 2 dots (45 and 46), because these are the only 2 consecutive

Fill up missing values using the other data?

允我心安 提交于 2019-12-06 12:44:40
A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c(NA, NA, NA, NA, "JAMES RIVER", NA, NA)) B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER", NA, "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM")) Expected: A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM")) B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER",

Multi-Indexed fillna in Pandas

拜拜、爱过 提交于 2019-12-06 09:24:41
问题 I have a multi-indexed dataframe and I'm looking to backfill missing values within a group. The dataframe I have currently looks like this: df = pd.DataFrame({ 'group': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2, 'Date': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", "2013-07-09", "2013-08-06", "2013-09-03", "2013-07-09", "2013-09-03"], 'Value': [np.nan, np.nan, np.nan, 9, 4, 40, 18, np.nan, np.nan, 5, np.nan, 2]}) df.Date = df[

Find the missing values in Julia like R's is.na function

南笙酒味 提交于 2019-12-06 08:32:51
The Julia 1.0.0 documentation says this about missing values in Julia and R: In Julia, missing values are represented by the missing object rather than by NA. Use ismissing(x) instead of isna(x). The skipmissing function is generally used instead of na.rm=TRUE (though in some particular cases functions take a skipmissing argument). Here is example code in R that I would like to duplicate in Julia: > v = c(1, 2, NA, 4) > is.na(v) [1] FALSE FALSE TRUE FALSE (First note that is.na is the R function's correct spelling, not isna as shown in the quote above, but that is not my point.) If I follow

R - Calculate difference (similarity measure) between similar datasets

时间秒杀一切 提交于 2019-12-06 06:08:41
I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question. Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2. Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2.