missing-data | 易学教程

Fill NaN values

阅读更多关于 Fill NaN values

问题 I have a dataframe TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR 2016-01-01 00:00:00 116 HC 250 2016-01-01 00:10:00 121 HC 250 2016-01-01 00:20:00 121 NaN 250 To use this dataframe, I must to fill the NaN values by (HC or HP) based on this condition: If (hour extracted from TIMESTAMP is in {0,1,2, 3, 4, 5, 22, 23} So I replace NaN by HC, else by HP. I did this function: def prep_data(data): data['PERIODE_TARIF']=np.where(data['PERIODE_TARIF']in (0, 1,2, 3, 4, 5, 22, 23),'HC','HP') return data But

Issue with NA values in R

阅读更多关于 Issue with NA values in R

I feel this should be something easy, I have looked x the internet, but I keep getting error messages. I have done plenty of analytics in the past but am new to R and programming. I have a pretty basic function to calculate means x columns of data: columnmean <-function(y){ nc <- ncol(y) means <- numeric(nc) for(i in 1:nc) { means[i] <- mean(y[,i]) } means } I'm in RStudio and testing it using the included 'airquality' dataset. When I load the AQ dataset and run my function: data("airquality") columnmean(airquality) I get back: NA NA 9.957516 77.882353 6.993464 15.803922 Because the first two

Fill missing value based on probability of occurrence

阅读更多关于 Fill missing value based on probability of occurrence

This is what my data.table/dataframe looks lke library(data.table) dt <- fread(' STATE ZIP PA 19333 PA 19327 PA 19333 PA NA PA 19355 PA 19333 PA NA PA 19355 PA NA ') I have three missing values in the ZIP column. I want to fill the missing values with nonmissing sample values of ZIPs according to their probability of occuring in the dataset. So for example ZIP 19333 occurs three times in the dataset and ZIP 19355 occurs twice in the dataset and 19327 occurs once. So ZIP 19333 has 50% probability of occurring in the dataset for PA , and 19355 has a 33.33% chance and 19327 has a 16.17% chance of

reshape from base vs dcast from reshape2 with missing values

阅读更多关于 reshape from base vs dcast from reshape2 with missing values

Whis this data frame, df <- expand.grid(id="01", parameter=c("blood", "saliva"), visit=c("V1", "V2", "V3")) df$value <- c(1:6) df$sex <- rep("f", 6) df > df id parameter visit value sex 1 01 blood V1 1 f 2 01 saliva V1 2 f 3 01 blood V2 3 f 4 01 saliva V2 4 f 5 01 blood V3 5 f 6 01 saliva V3 6 f When I reshape it in the "wide" format, I get identical results with both the base reshape function and the dcast function from reshape2 . reshape(df, timevar="visit", idvar=c("id", "parameter", "sex"), direction="wide") id parameter sex value.V1 value.V2 value.V3 1 01 blood f 1 3 5 2 01 saliva f 2 4 6

SQL Server Interpolate Missing rows

阅读更多关于 SQL Server Interpolate Missing rows

问题 I have the following table which records a value per day. The problem is that sometimes days are missing. I want to write a SQL query that will: Return the missing days Calculate the missing value using linear interpolation So from the following source table: Date Value -------------------- 2010/01/10 10 2010/01/11 15 2010/01/13 25 2010/01/16 40 I want to return: Date Value -------------------- 2010/01/10 10 2010/01/11 15 2010/01/12 20 2010/01/13 25 2010/01/14 30 2010/01/15 35 2010/01/16 40

SQL Server Interpolate Missing rows

阅读更多关于 SQL Server Interpolate Missing rows

I have the following table which records a value per day. The problem is that sometimes days are missing. I want to write a SQL query that will: Return the missing days Calculate the missing value using linear interpolation So from the following source table: Date Value -------------------- 2010/01/10 10 2010/01/11 15 2010/01/13 25 2010/01/16 40 I want to return: Date Value -------------------- 2010/01/10 10 2010/01/11 15 2010/01/12 20 2010/01/13 25 2010/01/14 30 2010/01/15 35 2010/01/16 40 Any help would be greatly appreciated. declare @MaxDate date declare @MinDate date select @MaxDate = MAX

How can I find the index of all NA in a dataframe column?

阅读更多关于 How can I find the index of all NA in a dataframe column?

I have a dataframe and in a particular column I want to find the index of all NA values. How can I do it? which(is.na(my.df$col.I.care.about)) 来源： https://stackoverflow.com/questions/23070665/how-can-i-find-the-index-of-all-na-in-a-dataframe-column

pandas - merging with missing values

阅读更多关于 pandas - merging with missing values

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaN s with other NaN s: >>> foo = DataFrame([ ['a',1,2], ['b',4,5], ['c',7,8], [np.NaN,10,11] ], columns=['id','x','y']) >>> bar = DataFrame([ ['a',3], ['c',9], [np.NaN,12] ], columns=['id','z']) >>> pd.merge(foo, bar, how='left', on='id') Out[428]: id x y z 0 a 1 2 3 1 b 4 5 NaN 2 c 7 8 9 3 NaN 10 11 12 [4 rows x 4 columns] This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially

Partially merge two datasets and fill in NAs in R

阅读更多关于 Partially merge two datasets and fill in NAs in R

I have two datasets a = raw dataset with thousands of observations of different weather events STATE EVTYPE 1 AL WINTER STORM 2 AL TORNADO 3 AL TSTM WIND 4 AL TSTM WIND 5 AL TSTM WIND 6 AL HAIL 7 AL HIGH WIND 8 AL TSTM WIND 9 AL TSTM WIND 10 AL TSTM WIND b = a dictionary table, which has a standard spelling for some weather events. EVTYPE evmatch 1 HIGH SURF ADVISORY <NA> 2 COASTAL FLOOD COASTAL FLOOD 3 FLASH FLOOD FLASH FLOOD 4 LIGHTNING LIGHTNING 5 TSTM WIND <NA> 6 TSTM WIND (G45) <NA> both are merged into df_new by evtype library(dplyr) df_new <- left_join(a, b, by = c("EVTYPE")) STATE

pandas - merging with missing values

阅读更多关于 pandas - merging with missing values

问题 There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaN s with other NaN s: >>> foo = DataFrame([ ['a',1,2], ['b',4,5], ['c',7,8], [np.NaN,10,11] ], columns=['id','x','y']) >>> bar = DataFrame([ ['a',3], ['c',9], [np.NaN,12] ], columns=['id','z']) >>> pd.merge(foo, bar, how='left', on='id') Out[428]: id x y z 0 a 1 2 3 1 b 4 5 NaN 2 c 7 8 9 3 NaN 10 11 12 [4 rows x 4 columns] This is unlike any RDB I've seen, normally missing values