missing-data | 易学教程

R: replace NA with item from vector

阅读更多关于 R: replace NA with item from vector

问题 I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x y I wrote this, and it worked for(i in 1:nrow(data.frame){ if( is.na(data.frame$X[i]) == TRUE){ data.frame$X[i] <- data.frame$Y[i] } } But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like is.na(data.frame$X) <

Fill in missing pandas data with previous non-missing value, grouped by key

阅读更多关于 Fill in missing pandas data with previous non-missing value, grouped by key

问题 I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? 回答1: You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

阅读更多关于 Handling missing/incomplete data in R--is there function to mask but not remove NAs?

问题 As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0':

How to fill NAs with LOCF by factors in data frame, split by country

阅读更多关于 How to fill NAs with LOCF by factors in data frame, split by country

问题 I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values: country value AUT NA AUT 5 AUT NA AUT NA GER NA GER NA GER 7 GER NA GER NA The following generates the above data frame: data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA)) Now, I would like to replace the NA values in each country subset using the method last observation carried

Python, Pandas : Return only those rows which have missing values

阅读更多关于 Python, Pandas : Return only those rows which have missing values

问题 While working in Pandas in Python... I'm working with a dataset that contains some missing values, and I'd like to return a dataframe which contains only those rows which have missing data. Is there a nice way to do this? (My current method to do this is an inefficient "look to see what index isn't in the dataframe without the missing values, then make a df out of those indices.") 回答1: You can use any axis=1 to check for least one True per row, then filter with boolean indexing: null_data =

How to replace NA (missing values) in a data frame with neighbouring values

阅读更多关于 How to replace NA (missing values) in a data frame with neighbouring values

问题 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? 回答1: Properly formatted your data looks like this 862 2006-05-19

R - Fill missing dates by group

阅读更多关于 R - Fill missing dates by group

In my data, there exist observations for some IDs in some months and not for others, e.g. dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01', '2017-02-01', '2017-04-01')) colnames(dat) <- c('id', 'value', 'date') I would like to, for each id value, insert a row that includes the month(s) missing for that id and NA for value . Is there a way to (somewhat) concisely do this for all months in seq(min(as.Date(dat$date)), max(as.Date(dat$date)), by

Can't drop NAN with dropna in pandas

阅读更多关于 Can't drop NAN with dropna in pandas

I import pandas as pd and run the code below and get the following result Code: traindataset = pd.read_csv('/Users/train.csv') print traindataset.dtypes print traindataset.shape print traindataset.iloc[25,3] traindataset.dropna(how='any') print traindataset.iloc[25,3] print traindataset.shape Output TripType int64 VisitNumber int64 Weekday object Upc float64 ScanCount int64 DepartmentDescription object FinelineNumber float64 dtype: object (647054, 7) nan nan (647054, 7) [Finished in 2.2s] From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN

How do I deal with NAs in residuals in a regression in R?

阅读更多关于 How do I deal with NAs in residuals in a regression in R?

So I am having some issues with some NA values in the residuals of a lm cross sectional regression in R. The issue isn't the NA values themselves, it's the way R presents them. For example: test$residuals # 1 2 4 5 # 0.2757677 -0.5772193 -5.3061303 4.5102816 test$residuals[3] # 4 # -5.30613 In this simple example a NA value will make one of the residuals go missing. When I extract the residuals I can clearly see the third index missing. So far so good, no complaints here. The problem is that the corresponding numeric vector is now one item shorter so the third index is actually the fourth. How

Select NA in a data.table in R

阅读更多关于 Select NA in a data.table in R

How do I select all the rows that have a missing value in the primary key in a data table. DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) Selecting for a particular value is easy DT["a",] Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct? DT[NA,]# does not work DT[is.na(x),] #does work Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",] , so in practice, this may not really matter much: library(data.table) library(rbenchmark) DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)