R substr function on multiple columns

问题

I have 3 columns. First column has unique ID, second and third columns have string data and some NA data. I need to extract info from column 2 and put it in separate columns and do the same thing for column 3. I am building a function as follows, using for loops. I need to split the columns after the third letter. [For example in the V1 column below, I need to break AAAbbb as AAA and bbb and put them in separate columns. I know I can use substr to do this. I am new to R, please help.

UID * V1 * V2 *

Z001NL * AAAbbb * IADSFO *

Z001NP * IADSFO * NA *

Z0024G * SFOHNL * NLSFO0 *

Here's my code.

test=read.csv("c:/some/path/in/windows/test.csv", header=TRUE)

substring_it = function(test)
{
for(i in 1:3){
for(j in 2:3){
answer = transform(test, code 1 = substr((test[[j,i]]), 1, 3), code2 = substr((test[j,i]), 4, 6))

}
}
return(answer)

}

hello = substring_it(test)

test will be my data frame that I will read in.

I need this as my output

UID * V1.1 * V1.2 * V2.1 * V2.2

Z001NL * AAA * bbb * IAD * SFO

Z001NP * IAD * SFO * NA * NA

Z0024G * SFO * HNL * NLS * SFO

回答1:

You can use sapply to apply a function to each element of a vector - this could be useful here, since you could use sapply on the columns of your original data frame (test) to create the columns for your new data frame.

Here's a solution that does this:

test = data.frame(UID = c('Z001NL', 'Z001NP', 'Z0024G'), 
  V1 = c('AAAbbb', 'IADSFO', 'SFOHNL'),
  V2 = c('IADSFO', NA, 'NLSFO0'))

substring_it = function(x){
  # x is a data frame
  c1 = sapply(x[,2], function(x) substr(x, 1, 3))
  c2 = sapply(x[,2], function(x) substr(x, 4, 6))
  c3 = sapply(x[,3], function(x) substr(x, 1, 3))
  c4 = sapply(x[,3], function(x) substr(x, 4, 6))
  return(data.frame(UID=x[,1], c1, c2, c3, c4))
}

substring_it(test)
# returns:
#     UID  c1  c2   c3   c4
#1 Z001NL AAA bbb  IAD  SFO
#2 Z001NP IAD SFO <NA> <NA>
#3 Z0024G SFO HNL  NLS  FO0

EDIT: here's a way to loop over columns if you have to do this a bunch of times. I'm not sure what order your original data frame's columns are in and what order you want the new data frame's columns to end up in, so you may need to play around with the "pos" counter. I also assumed the columns to be split were columns 2 thru 201 ("colindex"), so you'll probably have to change that.

newcolumns = list()
pos = 1 #counter for column index of new data frame
for(colindex in 2:201){
    newcolumns[[pos]] = sapply(test[,colindex], function(x) substr(x, 1, 3))
    newcolumns[[pos+1]] = sapply(test[,colindex], function(x) substr(x, 4, 6))
    pos = pos+2
}
newdataframe = data.frame(UID = test[,1], newcolumns)
# update "names(newdataframe)" as needed

来源：https://stackoverflow.com/questions/20783034/r-substr-function-on-multiple-columns

标签

function

apply

lapply

sapply