missing-data

R gbm handling of missing values

回眸只為那壹抹淺笑 提交于 2019-12-03 01:25:24
Does anyone know how gbm in R handles missing values? I can't seem to find any explanation using google. To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object. Suppose you have a gbm object mygbm . Using pretty.gbm.tree(mygbm, i.tree=1) you can visualize the first tree on mygbm, e.g.: SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 46 1.629728e+01 1 5 9 26.462908 1585 -4.396393e-06 1 45 1.850000e+01 2 3 4 11.363868 939 -4.370936e-04 2 -1 2.602236e-04 -1 -1 -1 0.000000 271 2.602236e-04 3 -1 -7.199873e-04 -1

How to create “NA” for missing data in a time series

故事扮演 提交于 2019-12-02 14:58:50
I have several files of data that look like this: X code year month day pp 1 4515 1953 6 1 0 2 4515 1953 6 2 0 3 4515 1953 6 3 0 4 4515 1953 6 4 0 5 4515 1953 6 5 3.5 Sometimes there is data missing, but I don't have NAs, the rows simply don't exist. I need to create NAs when the data is missing. I though I could start by identifying when that occurs by converting it to a zoo object and check for strict regularity (I never used zoo before), I used the following code: z.date<-paste(CET$year, CET$month, CET$day, sep="/") z <- read.zoo(CET, order.by= z.date ) reg<-is.regular(z, strict = TRUE) But

Counting not NA's for values of some column for each value of another row [duplicate]

偶尔善良 提交于 2019-12-02 13:27:34
This question already has an answer here: dplyr count non-NA value in group by [duplicate] 3 answers In R language - I have lets say I have a DF with two columns Fam and Prop both categorical, now Fam has repeated names like Algea, Fungi, etc and column Prop has categorical numbers and NA's. How can I get a table/output that for each value of A it tells me how many values are not. NA example: Fam Prop ------------- Algea one Fungi two Algea NA Algea three Fungi one Fungi NA Output : Algea 2 Fungi 2 I know using the count function should be a direction for the solution but can't seem to solve

How to fill null values in a Dataset using python that matches with two other columns?

会有一股神秘感。 提交于 2019-12-02 08:46:39
问题 I have a titanic Dataset. It has attributes and i was working manly on 1.Age 2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C) 3.Survived ( 0 for did not survived,1 for survived) I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q

Pandas fill missing values of a column based on the datetime values of another column

不想你离开。 提交于 2019-12-02 07:20:55
问题 Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different. Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session. My problem is that I have all the dates, but I am missing some of the corresponding session_id values.

Calculate mean of each column ignoring missing data with awk

心已入冬 提交于 2019-12-02 07:13:39
问题 I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example, na 0.93 na 0 na 0.51 1 1 na 1 na 1 1 1 na 0.97 na 1 0.92 1 na 1 0.01 0.34 I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account

Pandas fill missing values of a column based on the datetime values of another column

妖精的绣舞 提交于 2019-12-02 06:25:47
Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one , this one , and also this one , but I think my problem is different. Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session. My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column: If the action occurred

Issue with NA values in R

半腔热情 提交于 2019-12-02 04:56:12
问题 I feel this should be something easy, I have looked x the internet, but I keep getting error messages. I have done plenty of analytics in the past but am new to R and programming. I have a pretty basic function to calculate means x columns of data: columnmean <-function(y){ nc <- ncol(y) means <- numeric(nc) for(i in 1:nc) { means[i] <- mean(y[,i]) } means } I'm in RStudio and testing it using the included 'airquality' dataset. When I load the AQ dataset and run my function: data("airquality"

Calculate mean of each column ignoring missing data with awk

自古美人都是妖i 提交于 2019-12-02 04:15:47
I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example, na 0.93 na 0 na 0.51 1 1 na 1 na 1 1 1 na 0.97 na 1 0.92 1 na 1 0.01 0.34 I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account for missing data. All I know how to do is to calculate the mean of a single column but it treats the

mean-before-after imputation in R

孤街醉人 提交于 2019-12-02 02:59:56
问题 I'm new in R. My question is how to impute missing value using mean of before and after of the missing data point? example; using the mean from the upper and lower of each NA as the impute value. -mean for row number 3 is 38.5 -mean for row number 7 is 32.5 age 52.0 27.0 NA 23.0 39.0 32.0 NA 33.0 43.0 Thank you. 回答1: Here a solution using from na.locf from zoo package which replaces each NA with the most recent non-NA prior or posterior to it. 0.5*(na.locf(x,fromlast=TRUE) + na.locf(x)) [1]