I need help with something that might be fairly simple in R. I want to refer to a range of columns in a data frame (e.g., extracting a few select variables). However, I don't know their column numbers. Normally, if I wanted to extract columns 4-10 i would say mydata[,4:10].
However, given that I don't know the column numbers, I would want to refer to them by name. Is there an easy way to do this? in sas or spss it is fairly easy to refer to a range of variables by name. Alternatively, is there an easy way to figure out which column number corresponds to a variable name in R?
A column number can be identified from a column name within a data frame as follows:
which(colnames(mydf)=="a")
where mydf is a data frame and a is the name of the column the column number is required for.
(Source)
This can be used to create a column range:
firstcol = which(colnames(x)=="a")
lastcol = which(colnames(x)=="b")
mydf[c(firstcol:lastcol)]
Getting a range of columns can be done in several ways. subset(data.frame, select = name4:name10)
, works but is quite long. I used that before I got annoyed writing long commands for a simple thing. I made a function to tackle the naming columns / not remembering column numbers in large data frames:
coln <- function(X){
y <- rbind(seq(1,ncol(X)))
colnames(y) <- colnames(X)
rownames(y) <- "col.number"
return(y)}
Here is how it works:
df <- data.frame(a = 1:10, b =10:1, c = 1:10)
coln(df)
a b c
col.number 1 2 3
Now you can call them with numbers and still look at names.
Use %in%
in combination with names()
. It's useful for grabbing a group of columns from a data frame. You can negate the expression when you want to keep just a subset and drop the rest. Type ?"%in%"
at the R Console prompt for more details.
set.seed(1234)
mydf <- data.frame(A = runif(5, 1, 2),
B = runif(5, 3, 4),
C = runif(5, 5, 6),
D = runif(5, 7, 8),
E = runif(5, 9, 10))
mydf
keep.cols <- c('A','D','E')
mydf[, names(mydf) %in% keep.cols]
drop.cols <- c('A','B','C')
mydf[, !names(mydf) %in% drop.cols]
The data frame:
> mydf
A B C D E
1 1.113703 3.640311 5.693591 7.837296 9.316612
2 1.622299 3.009496 5.544975 7.286223 9.302693
3 1.609275 3.232551 5.282734 7.266821 9.159046
4 1.623379 3.666084 5.923433 7.186723 9.039996
5 1.860915 3.514251 5.292316 7.232226 9.218800
A subset of columns:
> mydf[, names(mydf) %in% keep.cols]
A D E
1 1.113703 7.837296 9.316612
2 1.622299 7.286223 9.302693
3 1.609275 7.266821 9.159046
4 1.623379 7.186723 9.039996
5 1.860915 7.232226 9.218800
Keeping a subset of columns and dropping the rest:
> mydf[, !names(mydf) %in% drop.cols]
D E
1 7.837296 9.316612
2 7.286223 9.302693
3 7.266821 9.159046
4 7.186723 9.039996
5 7.232226 9.218800
I think I figured it out, but it's a bit ornery. Here's an example using mtcars to get the columns between hp and vs. do.call
usually means there is a simpler way, though.
mtcars[do.call(seq, as.list(match(c("hp", "vs"), colnames(mtcars))))]
Here is a fun little function that combines the ideas behind Largh's answer with a handy function call. To use it, just enter
call.cols(mydata, "firstvarname", "lastvarname")
call.cols <- function(df, startvar, endvar) {
col.num <- function(df){
var.nums <- seq(1,ncol(df))
names(var.nums) <- colnames(df)
return(var.nums)
}
start.num <- as.numeric(col.num(df)[startvar])
end.num <- as.numeric(col.num(df)[endvar])
range.num <- start.num:end.num
return(df[range.num])
}
I plan to expand this to use for scale creation for psychometric research.
来源:https://stackoverflow.com/questions/20369145/refer-to-range-of-columns-by-name-in-r