missing-data

R: replace NA with item from vector

霸气de小男生 提交于 2019-11-27 14:07:37
问题 I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x y I wrote this, and it worked for(i in 1:nrow(data.frame){ if( is.na(data.frame$X[i]) == TRUE){ data.frame$X[i] <- data.frame$Y[i] } } But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like is.na(data.frame$X) <

Fill in missing pandas data with previous non-missing value, grouped by key

痴心易碎 提交于 2019-11-27 13:03:08
问题 I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? 回答1: You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

杀马特。学长 韩版系。学妹 提交于 2019-11-27 11:35:44
问题 As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0':

How to fill NAs with LOCF by factors in data frame, split by country

主宰稳场 提交于 2019-11-27 11:18:40
问题 I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values: country value AUT NA AUT 5 AUT NA AUT NA GER NA GER NA GER 7 GER NA GER NA The following generates the above data frame: data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA)) Now, I would like to replace the NA values in each country subset using the method last observation carried

Python, Pandas : Return only those rows which have missing values

穿精又带淫゛_ 提交于 2019-11-27 10:43:06
问题 While working in Pandas in Python... I'm working with a dataset that contains some missing values, and I'd like to return a dataframe which contains only those rows which have missing data. Is there a nice way to do this? (My current method to do this is an inefficient "look to see what index isn't in the dataframe without the missing values, then make a df out of those indices.") 回答1: You can use any axis=1 to check for least one True per row, then filter with boolean indexing: null_data =

How to replace NA (missing values) in a data frame with neighbouring values

会有一股神秘感。 提交于 2019-11-27 10:10:16
问题 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? 回答1: Properly formatted your data looks like this 862 2006-05-19

R - Fill missing dates by group

空扰寡人 提交于 2019-11-27 09:44:22
In my data, there exist observations for some IDs in some months and not for others, e.g. dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01', '2017-02-01', '2017-04-01')) colnames(dat) <- c('id', 'value', 'date') I would like to, for each id value, insert a row that includes the month(s) missing for that id and NA for value . Is there a way to (somewhat) concisely do this for all months in seq(min(as.Date(dat$date)), max(as.Date(dat$date)), by

Can't drop NAN with dropna in pandas

烈酒焚心 提交于 2019-11-27 09:34:25
I import pandas as pd and run the code below and get the following result Code: traindataset = pd.read_csv('/Users/train.csv') print traindataset.dtypes print traindataset.shape print traindataset.iloc[25,3] traindataset.dropna(how='any') print traindataset.iloc[25,3] print traindataset.shape Output TripType int64 VisitNumber int64 Weekday object Upc float64 ScanCount int64 DepartmentDescription object FinelineNumber float64 dtype: object (647054, 7) nan nan (647054, 7) [Finished in 2.2s] From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN

How do I deal with NAs in residuals in a regression in R?

泪湿孤枕 提交于 2019-11-27 09:06:49
So I am having some issues with some NA values in the residuals of a lm cross sectional regression in R. The issue isn't the NA values themselves, it's the way R presents them. For example: test$residuals # 1 2 4 5 # 0.2757677 -0.5772193 -5.3061303 4.5102816 test$residuals[3] # 4 # -5.30613 In this simple example a NA value will make one of the residuals go missing. When I extract the residuals I can clearly see the third index missing. So far so good, no complaints here. The problem is that the corresponding numeric vector is now one item shorter so the third index is actually the fourth. How

Select NA in a data.table in R

£可爱£侵袭症+ 提交于 2019-11-27 08:02:29
How do I select all the rows that have a missing value in the primary key in a data table. DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) Selecting for a particular value is easy DT["a",] Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct? DT[NA,]# does not work DT[is.na(x),] #does work Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",] , so in practice, this may not really matter much: library(data.table) library(rbenchmark) DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)