How to build data matrix from mixed and messy CSV file?

前端 未结 2 1781
失恋的感觉
失恋的感觉 2021-01-28 01:11

I have a huge .csv file like this :

Transcript Id   Gene Id(name)   Mirna Name  miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1         


        
相关标签:
2条回答
  • 2021-01-28 01:56

    Using this test data:

    Lines <- " Transcript Id   Gene Id(name)   Mirna Name  miTG score
    ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1
    UTR3    21:30717114-30717142    0.05994568  
    UTR3    21:30717414-30717442    0.13591267  
    ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1
    UTR3    6:105526681-105526709   0.133514751"
    

    read it all in and set the names, nms for the output. Then calculate the grouping vector, cs, using a cumulative sum. non-duplicates are the first row of each group and duplicates are the following rows. Merge these two sets of rows by group and extract out the highest MRE_score in each group:

    DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE, 
             check.names = FALSE)
    nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names
    DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows
    dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows
    both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)]  # merge ENST & UTR3 rows
    names(both) <- nms
    both$MRE_score <- as.numeric(both$MRE_score)
    Rank <- function(x) rank(x, ties.method = "first")
    out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score
    

    Here we get:

    > out
           Transcript              Id     Gene      Id(name) Mirna                  UTR3 MRE_score
    2 ENST00000286800 ENSG00000156273  (BACH1) hsa-let-7a-5p     1  21:30717414-30717442 0.1359127
    3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p     1 6:105526681-105526709 0.1335148
    

    Note that the question refers to a CDS column but what it is is not described nor does it appear in the example output so we ignored it.

    0 讨论(0)
  • 2021-01-28 02:02

    You could try to structure the CSV using regular expressions:

    textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1
    UTR3    21:30717114-30717142    0.05994568  
    UTR3    21:30717414-30717442    0.13591267  
    ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1
    UTR3    6:105526681-105526709   0.133514751"
    txt <- readLines(textConnection(textfile))
    
    sepr <- grepl("^ENST.*", txt) 
    r <- rle(sepr)
    r <- r$lengths[!r$values]
    
    regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)"
    m <- regexec(regex, txt[sepr])
    m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5)))
    m1 <- m1[rep(1:nrow(m1), r),]
    
    regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)"
    m <- regexec(regex, txt[!sepr])
    m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4)))
    
    df <- cbind(m1, m2[,-1])
    names(df) <- c("Transcript Id",    "Gene Id(name)",   "Mirna Name",        "miTG score",    "UTR3",        "MRE_score"   )
    rownames(df) <- NULL
    df
    # Transcript Id   Gene Id(name)                Mirna Name miTG score                  UTR3   MRE_score
    # 1 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717114-30717142  0.05994568
    # 2 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717414-30717442  0.13591267
    # 3 ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p          1 6:105526681-105526709 0.133514751
    
    0 讨论(0)
提交回复
热议问题