问题
I have fileA
in which the information is displayed by intervals - if consecutive positions are assigned the same value, these consecutive values are regrouped into one interval.
start end value label
123 78000 0 romeo #value 0 at positions 123 to 77999 included.
78000 78004 56 romeo #value 56 at positions 78000, 78001, 78002 and 78003.
78004 78005 12 romeo #value 12 at position 78004.
78006 78008 21 juliet #value 21 at positions 78006 and 78007.
78008 78056 8 juliet #value 8 at positions 78008 to 78055 included.
The intervals I am interested in are displayed in fileB
:
start end label
77998 78005 romeo
78007 78012 juliet
[EDIT]
The labels in fileA
were originally pulled in from fileB
, so it is safe to assume that the labels are always equivalent for overlapping intervals.
I am trying to extract the information for all the individual positions corresponding to the intervals in the second file, a process that I will call "deconvolution" for lack of a better word. The output fileC
should come up like this:
position value label
77998 0 romeo
77999 0 romeo
78000 56 romeo
78001 56 romeo
78002 56 romeo
78003 56 romeo
78004 12 romeo
78007 21 juliet
78008 8 juliet
78009 8 juliet
78010 8 juliet
78011 8 juliet
This is my code:
#read from tab-delimited text files which do not contain column names
A<-read.table("fileA.txt",sep="\t",colClasses=c("numeric","numeric","numeric","character"))
B<-read.table("fileB.txt",sep="\t",colClasses=c("numeric","numeric","character"))
#create empty table.frame for the output
C <- data.frame (1,2,3)
C <- C[-1,]
#add column names
colnames(A)<-c("start","end","value","label")
colnames(B)<-c("start","end","label")
colnames(C)<-c("position","value","label")
#extract position information
deconvolute <- function(x,y,z) {
for x$label %in% y$label {
#compute sequence of overlapping positions
overlap<-seq(max(x$start,y$start),x$end,1)
z$position<-overlap
#assign corresponding values to the other columns
z$value<-rep(x$value,length(overlap))
z$label<-rep(x$label,length(overlap))
}
}
deconvolute(A,B,C)
I am getting a lot of syntax errors in my function. I would be very happy if someone could help me fix them.
回答1:
# create sequence of positions
s <- unlist(apply(B, MARGIN=1, FUN=function(x) seq(x[1], as.numeric(x[2])-1)))
s
[1] 77998 77999 78000 78001 78002 78003 78004 78007 78008 78009 78010 78011
# matching between files A and B
pos <- unlist(sapply(s, FUN=function(x)
which(
apply(A, MARGIN=1, FUN=function(y) as.numeric(y[1])<=as.numeric(x) & as.numeric(x) < as.numeric(y[2])))
))
# new dataframe
deconvoluted <- data.frame(s, A$value[pos], A$label[pos])
deconvoluted
s A.value.pos. A.label.pos.
1 77998 0 romeo
2 77999 0 romeo
3 78000 56 romeo
4 78001 56 romeo
5 78002 56 romeo
6 78003 56 romeo
7 78004 12 romeo
8 78007 21 juliet
9 78008 8 juliet
10 78009 8 juliet
11 78010 8 juliet
12 78011 8 juliet
来源:https://stackoverflow.com/questions/21626354/deconvoluting-intervals-into-position-information