问题
I have a data frame where for each sample the columns can have multiple values, for example:
Gene Pvalue1 Pvalue2 Pvalue3 Beta
Ace 0.0381, ., 0.00357 0.01755, 0.001385 0.0037, NA , 0.039 -0.03,1,15
NOS NA 0.02 0.001, 0.00067 0.00009,25,30
I want to apply min()
and max()
for each gene's data (I have thousands of genes in total) in each column and get the smallest value for the pvalues but the largest value for columns such as the beta. So the output data would look like this:
Gene Pvalue1 Pvalue2 Pvalue3 Beta
Ace 0.00357 0.001385 0.0037 15
NOS NA 0.02 0.00067 30
I'm new to R and not sure if what I'm asking is possible, if there are multiple values in one cell are they viewed as strings?
回答1:
A possible solution using stringr
and dplyr
:
library(dplyr)
library(stringr)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
lapply(.,function(x) min(as.numeric(x),na.rm = T) ) %>%
unlist()
df %>%
mutate_at(names(df)[-1],getmin)
Gene Pvalue1 Pvalue2 Pvalue3 Beta
1 Ace 0.00357 0.001385 0.00370 -3e-02
2 NOS Inf 0.020000 0.00067 9e-05
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In min(as.numeric(x), na.rm = T) :
no non-missing arguments to min; returning Inf
The function getmin
extract the number with str_extract_all
:
str_extract_all(df$Pvalue2,"[0-9\\.-]+")
[[1]]
[1] "0.01755" "0.001385"
[[2]]
[1] "0.02"
It has the advantage of being insensible to space or other characters, but can extract just a dot. I then loop on this list to extract in each cell the minimum, and convert the list into a vector with unlist
. Using the as.numeric()
function convert the possible extracted .
to NA
.
the code df %>% mutate_at(names(df)[-1],getmin)
just apply this function on all columns exept the first one
edit: if you want to avoid inf values, you can use this slight modified version:
min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
lapply(.,function(x)min2(as.numeric(x)) ) %>%
unlist()
df %>%
mutate_at(names(df)[-1],getmin)
Gene Pvalue1 Pvalue2 Pvalue3 Beta
1 Ace 0.00357 0.001385 0.00370 -3e-02
2 NOS NA 0.020000 0.00067 9e-05
data:
df <- read.table(text = "
Gene Pvalue1 Pvalue2 Pvalue3 Beta
Ace 0.0381,.,0.00357 0.01755,0.001385 0.0037,NA,0.039 -0.03,1,15
NOS NA 0.02 0.001,0.00067 0.00009,25,30
",header = T)
回答2:
Here is a base R solution using regmatches
+ gregexpr
to sort out numbers, i.e.,
dPvalue <- t(apply(df[grep("Pvalue",names(df))], 1, function(v) {
unlist(Map(function(x) ifelse(length(x)>0, min(as.numeric(x)),NA), regmatches(v, gregexpr("-?\\d+(\\.\\d+)?",v))))
}))
Beta <- apply(df[grep("Beta",names(df))], 1, function(v) {
unlist(Map(function(x) ifelse(length(x)>0, max(as.numeric(x)),NA), regmatches(v, gregexpr("-?\\d+(\\.\\d+)?",v))))
})
dfout <- cbind(df["Gene"],Pvalue,Beta)
such that
> dfout
Gene Pvalue1 Pvalue2 Pvalue3 Beta
1 Ace 0.00357 0.001385 0.00370 15
2 NOS NA 0.020000 0.00067 30
DATA
df <- structure(list(Gene = structure(1:2, .Label = c("Ace", "NOS"), class = "factor"),
Pvalue1 = structure(c(1L, NA), .Label = "0.0381,.,0.00357", class = "factor"),
Pvalue2 = structure(1:2, .Label = c("0.01755,0.001385", "0.02"
), class = "factor"), Pvalue3 = structure(2:1, .Label = c("0.001,0.00067",
"0.0037,NA,0.039"), class = "factor"), Beta = structure(1:2, .Label = c("-0.03,1,15",
"0.00009,25,30"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
回答3:
Using data.table, convert wide-to-long, split on comma, get min for P-values and max for Betas, and finally convert back to long-to-wide.
library(data.table)
dt1 <- fread("
Gene Pvalue1 Pvalue2 Pvalue3 Beta
Ace 0.0381,.,0.00357 0.01755,0.001385 0.0037,NA,0.039 -0.03,1,15
NOS NA 0.02 0.001,0.00067 0.00009,25,30
")
dcast(
melt(dt1, id.vars = "Gene")[, paste0("col", 1:3) := lapply(tstrsplit(value, ","), as.numeric)
][, MinMax := ifelse(grepl("Pvalue", variable),
pmin(col1, col2, col3, na.rm = TRUE),
pmax(col1, col2, col3, na.rm = TRUE)) ],
Gene ~ variable, value.var = "MinMax")
# Gene Pvalue1 Pvalue2 Pvalue3 Beta
# 1: Ace 0.00357 0.001385 0.00370 15
# 2: NOS NA 0.020000 0.00067 30
# Warning message:
# In lapply(tstrsplit(value, ","), as.numeric) : NAs introduced by coercion
Note: same steps can be applied using dplyr/tidyr
.
回答4:
Another option is using data.table and matrixstats:
library(data.table)
library(matrixStats)
pval_cols <- grep("Pvalue", names(DT), fixed = TRUE, value = TRUE)
min_fun <- function(x) {
y <- tstrsplit(x, split = ",", fixed = TRUE)
y <- rowMins(sapply(y, as.numeric), na.rm = TRUE)
y <- replace(y, !is.finite(y), NA)
return(y)
}
DT[, (pval_cols) := lapply(.SD, min_fun)
, .SDcols = pval_cols][]
which gives:
> DT Gene Pvalue1 Pvalue2 Pvalue3 Beta 1: Ace 0.00357 0.001385 0.00370 -0.03,1,15 2: NOS NA 0.020000 0.00067 0.00009,25,30
For the Beta
-column(s) you can create a similar max_fun
: just replace rowMins
with rowMaxs
.
回答5:
Here's the general idea.
applyFunctionToString <- function(
string
, sep = ","
){
string <- gsub(" ", "", string)
string <- unlist(strsplit(string, sep))
string[string == "NA"] <- NA
numbers <- as.numeric(string)
min(numbers, na.rm = TRUE)
}
sapply(c("0.01755, 0.001385", "0.0037, NA , 0.039"), applyFunctionToString)
You actually want to get into string operations, convert each string into a numeric vector and then do your summary function (min
or max
).
The code I wrote here works in this instance, but you have to consider more factors:
- Do your strings contain other characters the need to be removed?
- What are missing values represented by?
You can also pass the function that you want to apply (min
for example), but then you have other questions like how do you pass additional arguments to that function (using ...
) - this would be out of scope.
Hope, it still helps a little.
来源:https://stackoverflow.com/questions/59786200/select-min-or-max-values-within-one-cell-delimited-string