missing-data | 易学教程

R: variable exclusion from formula not working in presence of missing data

阅读更多关于 R: variable exclusion from formula not working in presence of missing data

问题 I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test': > model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE ) > prediction <- predict(model, test, type = "class") the prediction resulted with all NAs: > head(prediction) [1] <NA> <NA> <NA> <NA> <NA> <NA> Levels: 2668 2752 2921 3005 the reason is that test$office contains NAs: >

How to combine duplicate rows in pandas?

阅读更多关于 How to combine duplicate rows in pandas?

问题 How to combine duplicate rows in pandas, filling in missing values? In the example below, some rows have missing values in the c1 column, but the c2 column has duplicates that can be used as an index to look up and fill in those missing values. the input data looks like this: c1 c2 id 0 10.0 a 1 NaN b 2 30.0 c 3 10.0 a 4 20.0 b 5 NaN c desired output: c1 c2 0 10 a 1 20 b 2 30 c But how to do this? Here is the code to generate the example data: import pandas as pd df = pd.DataFrame({ 'c1': [10

How to efficiently extrapolate missing data for multiple variables

阅读更多关于 How to efficiently extrapolate missing data for multiple variables

问题 I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA

Pandas read_csv, reading a boolean with missing values specified as an int

阅读更多关于 Pandas read_csv, reading a boolean with missing values specified as an int

问题 I am trying to import a csv into a pandas dataframe. I have boolean variables denoted with 1's and 0's, where missing values are identified with a -9. When I try to specify the dtype as boolean, I get a host of different errors, depending on what I try. Sample data: test.csv var1, var2 0, 0 0, 1 1, 3 -9, 0 0, 2 1, 7 I try to specify the dtype as I import: dtype_dict = {'var1':'bool','var2':'int'} nan_dict = {'var1':[-9]} foo = pd.read_csv('test.csv',dtype=dtype_dict, na_values=nan_dict) I get

Pandas: Filling data for missing dates

阅读更多关于 Pandas: Filling data for missing dates

问题 Let's say I've got the following table: ProdID Date Val1 Val2 Val3 Prod1 4/1/2019 1 3 4 Prod1 4/3/2019 2 3 54 Prod1 4/4/2019 3 4 54 Prod2 4/1/2019 1 3 3 Prod2 4/2/2019 1 3 4 Prod2 4/3/2019 2 4 4 Prod2 4/4/2019 2 5 3 Prod2 entries are populated correctly as we've got the data from 4/1/2019 to 4/4/2019 . Prod1 has 1 missing date - 4/2/2019 . I would like to find missing dates for all ProdIDs and fill in Val1-3 with data copied from the last of previous entry. For instance, I would like to copy

Random slope for time in subject not working in lme4

阅读更多关于 Random slope for time in subject not working in lme4

问题 I can not insert a random slope in this model with lme4(1.1-7): > difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit) Error: number of observations (=274) <= number of random effects (=278) for term (Tempo | id); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable With nlme it is working: > JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit) > summary(JSprova) Linear mixed-effects model fit by REML Data: dat AIC BIC

fill in missing data for group by unique ID [duplicate]

阅读更多关于 fill in missing data for group by unique ID [duplicate]

问题 This question already has answers here : Filling missing value in group (3 answers) Closed 20 days ago . My clinical data structure looks like this: patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104) group <- c(1,1,NA,2,NA,NA,1,1,1,2,2,NA) Data<- data.frame(patientid=patientid,group=group) If there is missing data then the NA should become the same value as the other group value for the same patient id. In other words a patient is always in the same group and the missing data

Replace NA in a series of variables with different types of missing

阅读更多关于 Replace NA in a series of variables with different types of missing

问题 This is my data. # A tibble: 10 x 6 id main s_0 s_1 s_2 s_3 <dbl> <fct> <fct> <fct> <fct> <fct> 1 1 5 75 A 4 110 2 2 NA NA NA NA NA 3 3 11 13 NA 7 769 4 4 NA NA NA NA NA 5 5 11 NA NA NA 835 6 6 13 39 NA 4 NA 7 7 NA NA NA NA NA 8 8 19 42 D 6 654 9 9 20 4 NA 7 577 10 10 NA NA NA NA NA As you can see, the column main indicates that rows in the other columns (s_0: s_4) answered the questions or not. Ids 2,4,7 and 10 were not eligible for the rest, however, other participants can answer or miss (s

Filling Missing sales value with zero and calculate 3 month average in PySpark

阅读更多关于 Filling Missing sales value with zero and calculate 3 month average in PySpark

问题 I Want add missing values with zero sales and calculate 3 month average in pyspark My Input : product specialty date sales A pharma 1/3/2019 50 A pharma 1/4/2019 60 A pharma 1/5/2019 70 A pharma 1/8/2019 80 A ENT 1/8/2019 50 A ENT 1/9/2019 65 A ENT 1/11/2019 40 my output: product specialty date sales 3month_avg_sales A pharma 1/3/2019 50 16.67 A pharma 1/4/2019 60 36.67 A pharma 1/5/2019 70 60 A pharma 1/6/2019 0 43.33 A pharma 1/7/2019 0 23.33 A pharma 1/8/2019 80 26.67 A ENT 1/8/2019 50 16

Replace dots in a float column with nan in Python

阅读更多关于 Replace dots in a float column with nan in Python

问题 I have a data frame df like this df = pd.DataFrame([ {'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50}, {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': '.........'}, {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': '...'}], index=['Store 1', 'Store 1', 'Store 2']) I want to replace the missing values in 'Cost' columns to np.nan . So far I have tried: df['Cost']=df['Cost'].str.replace("\.\.+", np.nan) and df['Cost']=re.sub('\.\.+',np.nan,df['Cost']) but neither of