Dictionary style replace multiple items

前端 未结 10 576
太阳男子
太阳男子 2020-11-22 05:02

I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages.

Currently I am going about it li

相关标签:
10条回答
  • 2020-11-22 05:39
    map = setNames(c("0101", "0102", "0103"), c("AA", "AC", "AG"))
    foo[] <- map[unlist(foo)]
    

    assuming that map covers all the cases in foo. This would feel less like a 'hack' and be more efficient in both space and time if foo were a matrix (of character()), then

    matrix(map[foo], nrow=nrow(foo), dimnames=dimnames(foo))
    

    Both matrix and data frame variants run afoul of R's 2^31-1 limit on vector size when there are millions of SNPs and thousands of samples.

    0 讨论(0)
  • 2020-11-22 05:39

    Note this answer started as an attempt to solve the much simpler problem posted in How to replace all values in data frame with a vector of values?. Unfortunately, this question was closed as duplicate of the actual question. So, I'll try to suggest a solution based on replacing factor levels for both cases, here.


    In case there is only a vector (or one data frame column) whose values need to be replaced and there are no objections to use factor we can coerce the vector to factor and change the factor levels as required:

    x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
    x <- factor(x)
    x
    #[1] 1 1 4 4 5 5 1 1 2
    #Levels: 1 2 4 5
    replacement_vec <- c("A", "T", "C", "G")
    levels(x) <- replacement_vec
    x
    #[1] A A C C G G A A T
    #Levels: A T C G
    

    Using the forcatspackage this can be done in a one-liner:

    x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
    forcats::lvls_revalue(factor(x), replacement_vec)
    #[1] A A C C G G A A T
    #Levels: A T C G
    

    In case all values of multiple columns of a data frame need to be replaced, the approach can be extended.

    foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), 
                      snp2 = c("AA", "AT", "AG", "AA"), 
                      snp3 = c(NA, "GG", "GG", "GC"), 
                      stringsAsFactors=FALSE)
    
    level_vec <- c("AA", "AC", "AG", "AT", "GC", "GG")
    replacement_vec <- c("0101", "0102", "0103", "0104", "0302", "0303")
    foo[] <- lapply(foo, function(x) forcats::lvls_revalue(factor(x, levels = level_vec), 
                                                           replacement_vec))
    foo
    #  snp1 snp2 snp3
    #1 0101 0101 <NA>
    #2 0103 0104 0303
    #3 0101 0103 0303
    #4 0101 0101 0302
    

    Note that level_vec and replacement_vec must have equal lengths.

    More importantly, level_vec should be complete , i.e., include all possible values in the affected columns of the original data frame. (Use unique(sort(unlist(foo))) to verify). Otherwise, any missing values will be coerced to <NA>. Note that this is also a requirement for Martin Morgans's answer.

    So, if there are only a few different values to be replaced you will be probably better off with one of the other answers, e.g., Ramnath's.

    0 讨论(0)
  • 2020-11-22 05:40

    Used @Ramnath's answer above, but made it read (what to be replaced and what to be replaced with) from a file and use gsub rather than replace.

    hrw <- read.csv("hgWords.txt", header=T, stringsAsFactor=FALSE, encoding="UTF-8", sep="\t") 
    
    for (i in nrow(hrw)) 
    {
    document <- gsub(hrw$from[i], hrw$to[i], document, ignore.case=TRUE)
    }
    

    hgword.txt contains the following tab separated

    "from"  "to"
    "AA"    "0101"
    "AC"    "0102"
    "AG"    "0103" 
    
    0 讨论(0)
  • 2020-11-22 05:42

    Using dplyr::recode:

    library(dplyr)
    
    mutate_all(foo, funs(recode(., "AA" = "0101", "AC" = "0102", "AG" = "0103",
                                .default = NA_character_)))
    
    #   snp1 snp2 snp3
    # 1 0101 0101 <NA>
    # 2 0103 <NA> <NA>
    # 3 0101 0103 <NA>
    # 4 0101 0101 <NA>
    
    0 讨论(0)
提交回复
热议问题