missing-data

Reading multiple files and calculating mean based on user input

a 夏天 提交于 2019-11-26 16:36:38
I am trying to write a function in R which takes 3 inputs: Directory pollutant id I have a directory on my computer full of CSV's files i.e. over 300. What this function would do is shown in the below prototype: pollutantmean <- function(directory, pollutant, id = 1:332) { ## 'directory' is a character vector of length 1 indicating ## the location of the CSV files ## 'pollutant' is a character vector of length 1 indicating ## the name of the pollutant for which we will calculate the ## mean; either "sulfate" or "nitrate". ## 'id' is an integer vector indicating the monitor ID numbers ## to be

Replace missing values with column mean

蓝咒 提交于 2019-11-26 16:05:59
I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for looping over columns is not working: for(i in 1:ncol(data)){ data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE)) } the values are not replaced. Can someone please help me with this? A relatively simple modification of your code should solve the issue: for(i in 1:ncol(data)){ data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE) } If DF is

R - Fill missing dates by group

偶尔善良 提交于 2019-11-26 14:49:06
问题 In my data, there exist observations for some IDs in some months and not for others, e.g. dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01', '2017-02-01', '2017-04-01')) colnames(dat) <- c('id', 'value', 'date') I would like to, for each id value, insert a row that includes the month(s) missing for that id and NA for value . Is there a way to

Replace NA with previous or next value, by group, using dplyr

拥有回忆 提交于 2019-11-26 14:39:27
I have a data frame which is arranged by descending order of date. ps1 = data.frame(userID = c(21,21,21,22,22,22,23,23,23), color = c(NA,'blue','red','blue',NA,NA,'red',NA,'gold'), age = c('3yrs','2yrs',NA,NA,'3yrs',NA,NA,'4yrs',NA), gender = c('F',NA,'M',NA,NA,'F','F',NA,'F') ) I wish to impute(replace) NA values with previous values and grouped by userID In case the first row of a userID has NA then replace with the next set of values for that userid group. I am trying to use dplyr and zoo packages something like this...but its not working cleanedFUG <- filteredUserGroup %>% group_by(UserID)

Select NA in a data.table in R

蓝咒 提交于 2019-11-26 13:57:31
问题 How do I select all the rows that have a missing value in the primary key in a data table. DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) Selecting for a particular value is easy DT["a",] Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct? DT[NA,]# does not work DT[is.na(x),] #does work 回答1: Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",] , so in practice, this may not really matter much:

Remove NA values from a vector

拥有回忆 提交于 2019-11-26 12:45:59
I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values. How can I remove the NA values so that I can compute the max? Trying ?max , you'll see that it actually has a na.rm = argument, set by default to FALSE . (That's the common default for many other R functions, including sum() , mean() , etc.) Setting na.rm=TRUE does just what you're asking for: d <- c(1, 100, NA, 10) max(d, na.rm=TRUE) If you do want to remove all of the NA s, use this idiom instead: d <- d[!is.na(d)

Replace NA in column with value in adjacent column

不问归期 提交于 2019-11-26 11:50:48
This question is related to a post with a similar title ( replace NA in an R vector with adjacent values ). I would like to scan a column in a data frame and replace NA's with the value in the adjacent cell. In the aforementioned post, the solution was to replace the NA not with the value from the adjacent vector (e.g. the adjacent element in the data matrix) but was a conditional replace for a fixed value. Below is a reproducible example of my problem: UNIT <- c(NA,NA, 200, 200, 200, 200, 200, 300, 300, 300,300) STATUS <-c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE',

Can&#39;t drop NAN with dropna in pandas

人盡茶涼 提交于 2019-11-26 11:36:14
问题 I import pandas as pd and run the code below and get the following result Code: traindataset = pd.read_csv(\'/Users/train.csv\') print traindataset.dtypes print traindataset.shape print traindataset.iloc[25,3] traindataset.dropna(how=\'any\') print traindataset.iloc[25,3] print traindataset.shape Output TripType int64 VisitNumber int64 Weekday object Upc float64 ScanCount int64 DepartmentDescription object FinelineNumber float64 dtype: object (647054, 7) nan nan (647054, 7) [Finished in 2.2s]

How to get Python to gracefully format None and non-existing fields [duplicate]

孤街浪徒 提交于 2019-11-26 11:01:26
问题 This question already has answers here : Leaving values blank if not passed in str.format (7 answers) Closed 5 years ago . If I write in Python: data = {\'n\': 3, \'k\': 3.141594, \'p\': {\'a\': 7, \'b\': 8}} print(\'{n}, {k:.2f}, {p[a]}, {p[b]}\'.format(**data)) del data[\'k\'] data[\'p\'][\'b\'] = None print(\'{n}, {k:.2f}, {p[a]}, {p[b]}\'.format(**data)) I get: 3, 3.14, 7, 8 Traceback (most recent call last): File \"./funky.py\", line 186, in <module> print(\'{n}, {k:.2f}, {p[a]}, {p[b]}\

python format string unused named arguments [duplicate]

筅森魡賤 提交于 2019-11-26 09:25:32
问题 This question already has an answer here: partial string formatting 16 answers Let\'s say I have: action = \'{bond}, {james} {bond}\'.format(bond=\'bond\', james=\'james\') this wil output: \'bond, james bond\' Next we have: action = \'{bond}, {james} {bond}\'.format(bond=\'bond\') this will output: KeyError: \'james\' Is there some workaround to prevent this error to happen, something like: if keyrror: ignore, leave it alone (but do parse others) compare format string with available named