问题
I have 3 columns. First column has unique ID, second and third columns have string data and some NA data. I need to extract info from column 2 and put it in separate columns and do the same thing for column 3. I am building a function as follows, using for loops. I need to split the columns after the third letter. [For example in the V1 column below, I need to break AAAbbb as AAA and bbb and put them in separate columns. I know I can use substr to do this. I am new to R, please help.
UID * V1 * V2 *
Z001NL * AAAbbb * IADSFO *
Z001NP * IADSFO * NA *
Z0024G * SFOHNL * NLSFO0 *
Here's my code.
test=read.csv("c:/some/path/in/windows/test.csv", header=TRUE)
substring_it = function(test)
{
for(i in 1:3){
for(j in 2:3){
answer = transform(test, code 1 = substr((test[[j,i]]), 1, 3), code2 = substr((test[j,i]), 4, 6))
}
}
return(answer)
}
hello = substring_it(test)
test will be my data frame that I will read in.
I need this as my output
UID * V1.1 * V1.2 * V2.1 * V2.2
Z001NL * AAA * bbb * IAD * SFO
Z001NP * IAD * SFO * NA * NA
Z0024G * SFO * HNL * NLS * SFO
回答1:
You can use sapply
to apply a function to each element of a vector - this could be useful here, since you could use sapply on the columns of your original data frame (test) to create the columns for your new data frame.
Here's a solution that does this:
test = data.frame(UID = c('Z001NL', 'Z001NP', 'Z0024G'),
V1 = c('AAAbbb', 'IADSFO', 'SFOHNL'),
V2 = c('IADSFO', NA, 'NLSFO0'))
substring_it = function(x){
# x is a data frame
c1 = sapply(x[,2], function(x) substr(x, 1, 3))
c2 = sapply(x[,2], function(x) substr(x, 4, 6))
c3 = sapply(x[,3], function(x) substr(x, 1, 3))
c4 = sapply(x[,3], function(x) substr(x, 4, 6))
return(data.frame(UID=x[,1], c1, c2, c3, c4))
}
substring_it(test)
# returns:
# UID c1 c2 c3 c4
#1 Z001NL AAA bbb IAD SFO
#2 Z001NP IAD SFO <NA> <NA>
#3 Z0024G SFO HNL NLS FO0
EDIT: here's a way to loop over columns if you have to do this a bunch of times. I'm not sure what order your original data frame's columns are in and what order you want the new data frame's columns to end up in, so you may need to play around with the "pos" counter. I also assumed the columns to be split were columns 2 thru 201 ("colindex"), so you'll probably have to change that.
newcolumns = list()
pos = 1 #counter for column index of new data frame
for(colindex in 2:201){
newcolumns[[pos]] = sapply(test[,colindex], function(x) substr(x, 1, 3))
newcolumns[[pos+1]] = sapply(test[,colindex], function(x) substr(x, 4, 6))
pos = pos+2
}
newdataframe = data.frame(UID = test[,1], newcolumns)
# update "names(newdataframe)" as needed
来源:https://stackoverflow.com/questions/20783034/r-substr-function-on-multiple-columns