问题
Very new, so let me know if this is asking too much. I am trying to sub set panel data, in R, into two different categories; one that has complete information for variables and one that has incomplete information for variables. My data looks like this:
Person Year Income Age Sex
1 2003 1500 15 1
1 2004 1700 16 1
1 2005 2000 17 1
2 2003 1400 25 0
2 2004 1900 26 0
2 2005 2000 27 0
What I need to do is go through each column ( not columns 1 and 2 ) and if the data is full for the variable ( variables are defined by the id in the first column and then the column name, in the picture above an example is person1Income) return that to a data set. Else put it in a different data set. Here is my meta code and an example of what it should do given the above data. Note: I call variables by their id name then the column name, for instance the variable person1Income would be the first three rows in column three.
for(each variable in all columns except 1 and 2 in data set) if (variable = FULL) { return to data set "completes" }
else {put in data set "incompletes"}
completes = person1Income, person2Income, person1Age, person2Age, person1Sex, person2 sex
incompletes = {empty because the above info is full}
I understand if someone can't answer this question completely, but any help is appreciated. Also if my goal is not clear, let me know and I will try to clarify.
tl;dr I can't yet explain it in one sentence so...sorry.
Edit: visualization of what I mean by complete and incomplete variables. screenshot
回答1:
Using your picture, here's a stab at what you want. It may be long-winded and others may have a more elegant way of doing it, but it gets the job done:
library("reshape2")
con <- textConnection("Person Year Income Age Sex
1 2003 1500 15 1
1 2004 1700 16 1
1 2005 2000 17 1
2 2003 1400 25 0
2 2004 1900 NA 0
2 2005 2000 27 0
3 2003 NA 25 0
3 2004 1900 NA 0
3 2005 2000 27 0")
pnls <- read.table(con, header=TRUE)
# reformat table for easier processing
pnls2 <- melt(pnls, id=c("Person"))
# and select those rows that relate to values
# of income and age
pnls2 <- subset(pnls2,
variable == "Income" | variable == "Age")
# create column of names in desired format (e.g Person1Age etc)
pnls2$name <- paste("Person", pnls2$Person, pnls2$variable, sep="")
# Collect full set of unique names
name.set <- unique(pnls2$name)
# find the incomplete set
incomplete <- unique( pnls2$name[ is.na(pnls2$value) ])
# then find the complement of the incomplete set
complete <- setdiff(name.set, incomplete)
# These two now contain list of complete and incomplete variables
complete
incomplete
If you are not familiar with melt
ing and the reshape2
package, you may want to run it line by line, and examine the value of pnls2
at different stages to see how this works.
EDIT: adding code to compile the values as requested by @bstockton. I am sure there is a much more appropriate R idiom to do this, but once again, in the absence of better answers: this works
# use these lists of complete and incomplete variable names
# as keys to collect lists of values for each variable name
compile <- function(keys) {
holder = list()
for (n in keys) {
holder[[ n ]] <- subset(pnls2, pnls2$name == n)[,3]
}
return( as.data.frame(holder) )
}
complete.recs <- compile(complete)
incomplete.recs <- compile(incomplete)
回答2:
Let's assume this is in a data.frame with name == 'dfrm'
completes <- dfrm[ complete.cases(dfrm[-(1:2)]) ,]
incompletes <- dfrm[ !complete.cases(dfrm[-(1:2)]) ,]
Thanks to @WojciechSobala for noticing my missing parens. For the question of identifying which column the missing values are in one could create a list: The list of id's is simple. The identification of which columns have missing values is also fairly easy to provide, but I have no idea what you mean by "the values in that column that correspond to the id variable" since they are all NA. For the identification step, you can use:
apply(incompletes, 1, function(x) c(x[1], x[2], which(is.na(x[-(1:2)]))))
I see now what you are asking. I don't have a solution yet but let me show you a couple of R functions that might help when it comes to enumerating and working with the categories that are formed by cross-classifying on two column values:
dat <- structure(list(Person = c(1L, 1L, 1L, 2L, 2L, 2L), Year = c(2003L,
2004L, 2005L, 2003L, 2004L, 2005L), Income = c(1500L, NA, 2000L,
1400L, 1900L, 2000L), Age = c(15L, 16L, 17L, 25L, 26L, 27L),
Sex = c(1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Person", "Year",
"Income", "Age", "Sex"), row.names = c(NA, -6L), class = "data.frame")
completes <- lapply( split(dat[ , 3:5], dat$Person), function(x) sapply(x, function(y) { if( all( !is.na(y)) ) { y } else { NA} }) )
$`1`
$`1`$Income
[1] NA
$`1`$Age
[1] 15 16 17
$`1`$Sex
[1] 1 1 1
$`2`
Income Age Sex
[1,] 1400 25 0
[2,] 1900 26 0
[3,] 2000 27 0
incompletes <- lapply( split(dat[ , 3:5], dat$Person), function(x) sapply(x, function(y) { if( !all( !is.na(y)) ) { y } else { NA} }) )
$`1`
$`1`$Income
[1] 1500 NA 2000
$`1`$Age
[1] NA
$`1`$Sex
[1] NA
$`2`
Income Age Sex
NA NA NA
来源:https://stackoverflow.com/questions/10202638/sub-setting-panel-data