问题
In genomics research, you often have many strings with duplicate gene names. I would like to find an efficient way to only keep the unique gene names in a string. This is an example that works. But, isn't it possible to do this in one step, i.e., without having to split the entire string and then having to past the unique elements back together?
genes <- c("GSTP1;GSTP1;APC")
a <- unlist(strsplit(genes, ";"))
paste(unique(a), collapse=";")
[1] "GSTP1;APC"
回答1:
An alternative is doing
unique(unlist(strsplit(genes, ";")))
#[1] "GSTP1" "APC"
Then this should give you the answer
paste(unique(unlist(strsplit(genes, ";"))), collapse = ";")
#[1] "GSTP1;APC"
回答2:
Based on the example showed, perhaps
gsub("(\\w+);\\1", "\\1", genes)
#[1] "GSTP1;APC"
来源:https://stackoverflow.com/questions/38210469/keep-only-unique-elements-in-string-in-r