missing-data | 易学教程

How to remove more than 2 consecutive NA's in a column?

阅读更多关于 How to remove more than 2 consecutive NA's in a column?

问题 I am new to R, In my data Frame I have col1("Timestamp"), col2("Values"). I have to remove rows of more than 2 consecutive NA in col2. My dataframe Looks like the below one, Timestamp | values -----------|-------- 2011-01-02 | 2 2011-01-03 | 3 2011-01-04 | NA 2011-01-05 | 1 2011-01-06 | NA 2011-01-07 | NA 2011-01-08 | 8 2011-01-09 | 6 2011-01-10 | NA 2011-01-11 | NA 2011-01-12 | NA 2011-01-13 | 2 I would like to remove more than 2 duplicate rows based on second column. Expected output -

How to replace missing values with group mode in Pandas?

阅读更多关于 How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds". df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0])) I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you! mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing

Obtain unstandardized factor scores from factor analysis in R

阅读更多关于 Obtain unstandardized factor scores from factor analysis in R

I'm conducting a factor analysis of several variables in R using factanal() (but am open to using other packages). I want to determine each case's factor score, but I want the factor scores to be unstandardized and on the original metric of the input variables. When I run the factor analysis and obtain the factor scores, they are standardized with a normal distribution of mean=0, SD=1, and are not on the original metric of the input variables. How can I obtain unstandardized factor scores that have the same metric as the input variables? Ideally, this would mean a similar mean, sd, range, and

Efficient solution for forward filling missing values in a pandas dataframe column?

阅读更多关于 Efficient solution for forward filling missing values in a pandas dataframe column?

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment. df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]}) # desired output a b 1 1 1 1 2 2 2 2 2 2 Here are the three solutions that I've tried so far. # really slow solutions df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill')) df['b'] = df.groupby('a')['b'].fillna(method='ffill') # much faster solution, but more memory intensive and ugly all around tmp = df

filling in missing data in one data frame with info from another

阅读更多关于 filling in missing data in one data frame with info from another

问题 There are two data set, A & B, as below: A <- data.frame(TICKER=c("00EY","00EY","00EY","00EY","00EY"), CUSIP=c(NA,NA,"48205A10","48205A10","48205A10"), OFTIC=c(NA,NA,"JUNO","JUNO","JUNO"), CNAME=c(NA,NA, "JUNO", "JUNO","JUNO"), ANNDATS=c("2015-01-13","2015-01-13","2015-01-13","2015-01-13","2015-01-13"), ANALYS=c(00076659,00105887,00153117,00148921,00086659), stringsAsFactors = F) B <- data.frame(TICKER=c("00EY","00EY","00EY","00EY"), CUSIP=c("48205A10","48205A10","48205A10","48205A10"), OFTIC

R - Plotting a line with missing NA values

阅读更多关于 R - Plotting a line with missing NA values

I have the following data.frame, "subset" Time A B C 2016-10-07 06:16:46 NA NA 41 2016-10-07 06:26:27 40 39 42 2016-10-07 06:38:23 NA 40 NA 2016-10-07 06:41:06 42 42 44 2016-10-07 06:41:06 NA 42 44 2016-10-07 06:41:06 NA 42 44 2016-10-07 06:41:07 44 43 48 2016-10-07 06:41:41 NA 43 48 2016-10-07 06:42:44 45 42 48 2016-10-07 06:48:40 46 45 48 I would like to have a plot where "Time" is the x-axis, "A" is a line and "B" and "C" are points. However, when i plot this, the only line that appears for "A" is the one connecting the last 2 dots (45 and 46), because these are the only 2 consecutive

Fill up missing values using the other data?

阅读更多关于 Fill up missing values using the other data?

A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c(NA, NA, NA, NA, "JAMES RIVER", NA, NA)) B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER", NA, "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM")) Expected: A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM")) B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), Item_B = c("JAMES RIVER",

Multi-Indexed fillna in Pandas

阅读更多关于 Multi-Indexed fillna in Pandas

问题 I have a multi-indexed dataframe and I'm looking to backfill missing values within a group. The dataframe I have currently looks like this: df = pd.DataFrame({ 'group': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2, 'Date': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", "2013-07-09", "2013-08-06", "2013-09-03", "2013-07-09", "2013-09-03"], 'Value': [np.nan, np.nan, np.nan, 9, 4, 40, 18, np.nan, np.nan, 5, np.nan, 2]}) df.Date = df[

Find the missing values in Julia like R's is.na function

阅读更多关于 Find the missing values in Julia like R's is.na function

The Julia 1.0.0 documentation says this about missing values in Julia and R: In Julia, missing values are represented by the missing object rather than by NA. Use ismissing(x) instead of isna(x). The skipmissing function is generally used instead of na.rm=TRUE (though in some particular cases functions take a skipmissing argument). Here is example code in R that I would like to duplicate in Julia: > v = c(1, 2, NA, 4) > is.na(v) [1] FALSE FALSE TRUE FALSE (First note that is.na is the R function's correct spelling, not isna as shown in the quote above, but that is not my point.) If I follow

R - Calculate difference (similarity measure) between similar datasets

阅读更多关于 R - Calculate difference (similarity measure) between similar datasets

I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question. Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2. Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2.