问题
Let's say I have a data table called YC
that looks like this:
Categories: colsums: tillTF:
ID: cat NA 0
MA NA 0
spayed NA 0
declawed NA 0
black NA 0
3 NA 0
no 57 1
claws NA 0
calico NA 0
4 NA 0
no 42 1
striped NA 0
0.5 NA 0
yes 84 1
not fixed NA 0
declawed NA 0
black NA 0
0.2 NA 0
yes 19 1
0.2 NA 0
yes 104 1
NH NA 0
spayed NA 0
claws NA 0
striped NA 0
12 NA 0
no 17 1
black NA 0
4 NA 0
yes 65 1
ID: DOG NA 0
MA NA 0
...
Only it's 1) not actually pivot table, it's inconsistently formatted to look like one and 2) the data is much more complicated, and was entered inconstantly over the course of a few decades. The only assumption that can be safely made about the data is that there are 12 variables associated with each record, and they are always entered in the same order.
My goal is to parse this data so that each attribute and associated numeric record are in in appropriate columns in a single row, like this:
Cat MA spayed declawed black 3 no 57
Cat MA spayed claws calico 0.5 no 42
Cat MA not fixed declawed black 0.2 yes 19
Cat MA not fixed declawed black 0.2 yes 104
Cat NH spayed claws striped 12 no 17
Cat NH spayed claws black 4 yes 65
Dog MA ....
I've written a for loop which identifies a "record" and then re-writes values in an array by reading backwards up the column in the data table until another "record" is reached. I'm new to R, and so wrote out my ideal loop without knowing whether it was possible.
array<-rep(0, length(7))
for (i in 1:7)
if(YC$tillTF[i]==1){
array[7]<-(YC$colsums[i])
array[6]<-(YC$Categories[i])
array[5]<-(YC$Categories[i-1])
array[4]<-(YC$Categories[i-2])
array[3]<-(YC$Categories[i-3])
array[2]<-(YC$Categories[i-4])
array[1]<-(YC$Categories[i-5])
}
YC_NT<-rbind(array)
Once array
is filled in, I want to loop through YC
and create a new row in YC_NT
for each unique record:
for (i in 8:length(YC$tillTF))
if (YC$tillTF[i]==1){
array[8]<-(YC$colsums[i])
array[7]<-(YC$Categories[i])
if (YC$tillTF[i-1]==0){
array[6]<-YC$Categories[i-1]
}else{
rbind(array, YC_NT)}
if (YC$tillTF[i-2]==0){
array[5]<-YC$Categories[i-2]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-3]==0){
array[4]<-YC$Categories[i-3]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-4]==0){
array[3]<-YC$Categories[i-4]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-5]==0){
array[2]<-YC$Categories[i-5]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-6]==0){
array[1]<-YC$Categories[i-6]
}else{
rbind(array, YC_NT)}
}else{
array<-array}
When I run this loop within a function on my data, I'm getting my YC_NT
data table back containing a single row. After spending a few days searching, I don't know that there is an R function which would be able to add the vector array
to last row of a data table without giving it a unique name every time. My questions:
1) Is there a function that would add a vector called array
to the end of a data table without re-writing a previous row called array
?
2) If no such function exists, how could I create a new name for array
every time my for loop reached a new numeric record?
Thanks for your help,
回答1:
rbind
or rbind.fill should do the trick. Alternatively, you can insert a row more efficiently with code such as:
df[nrow(df) + 1,] <- newrow
回答2:
So I'm going to assume a new record begins every time tillTF=1
. And that the n
variables specified for the next subject are just the last n
variables, the previous values all remain the same. I'm aslo assuming that all records are "complete" in that the last line is tillTF=1
. (To make the last statement true, I removed the last two lines form your sample)
Here's how I might read the data in
dog <- read.fwf("dog.txt", widths=c(22,11,7), skip=1, stringsAsFactors=F)
dog$V1 <- gsub("\\s{2,}","",dog$V1)
dog$V2 < -gsub("\\s","",dog$V2)
dog$V3 <- as.numeric(gsub("\\s","",dog$V3))
So I read in the data here and and strip off the extra spaces. Now I will add an ID column giving each record a unique ID and incrementing that value every time tillTF=1
. Then i'll split the data on that ID value
dog$ID<-c(0, cumsum(dog$V3[-nrow(dog)]))
dv <- lapply(split(dog, dog$ID), function(x) {
c(x$V1, x$V2[nrow(x)])}
)
Now I'll go through the list with Reduce
and each time replace the last n
variables with the n
variables for a given ID.
trans < -Reduce(function(a,b) {
a[(length(a)-length(b)+1):length(a)] <- b
a
}, dv, accumulate=T)
Now i'll put all the data together with tabs and then use read.table
to process the data and do all the proper data conversions and create a data frame
dd<-read.table(text=sapply(a, paste0, collapse="\t"), sep="\t")
That gives
# print(dd)
V1 V2 V3 V4 V5 V6 V7 V8
1 ID: cat MA spayed declawed black 3.0 no 57
2 ID: cat MA spayed claws calico 4.0 no 42
3 ID: cat MA spayed claws striped 0.5 yes 84
4 ID: cat MA not fixed declawed black 0.2 yes 19
5 ID: cat MA not fixed declawed black 0.2 yes 104
6 ID: cat NH spayed claws striped 12.0 no 17
7 ID: cat NH spayed claws black 4.0 yes 65
So as you can see, I left the "ID:" on but it should be easy enough to strip that off. But these commands do the basic reshaping for you. There are fewer arrays and if statements and rbinding in the solution which is nice, but I encourage you to make sure you understand each line if you want to use it.
Also note that my output is slightly different than your expected output; you are missing the "84" value and have the calico with "42" listed as "0.5" rather than "4.0". So let me know if I was wrong in how I interpreted the data or perhaps correct the example output.
来源:https://stackoverflow.com/questions/24122131/parsing-data-in-r-alternative-to-rbind-which-can-be-put-in-for-loop-to-writ