问题
I am a student conducting a gene expression survival analysis in R. I have the expression data for 249 patients, and I am using 6,000 genes as well as their event-free survival times and vital state as response variables. When I tried to run the Cox regression on my dataset, I got extremely strange results (p-values of 0.00 and strange hazard ratios). I have checked over my code multiple times, but I am not able to catch my mistake (when I tried earlier with just one gene, it worked fine, but when I try to test multiple genes using the '.' function, I am not getting porper results). I would highly appreciate any help and have attached both my code and output! Let me know if more information is needed.
library(survival)
options(expressions = 5e5)
firstSplitData <- read.delim("/Users/menon/OneDrive/Desktop/csrsef files/FirstSplitDataFrame.txt")
firstInitialData <- data.frame(firstSplitData)
firstEventFreeTime <- firstInitialData[ , c("EFST")]
firstVitalStatus <- firstInitialData[, c("Status")]
#create a temporary object to use in the final object in order to be able to use '.'
temporaryObj <- Surv(as.numeric(firstEventFreeTime), firstVitalStatus == 2)
firstFinalData <- data.frame(SurvObj = temporaryObj)
#bind the two together for the final data
firstFinalData <- cbind(firstFinalData, firstInitialData[, 2:ncol(firstInitialData)])
#create final cox model
firstCox <- coxph(SurvObj ~ ., data = firstFinalData)
summary(firstCox)$coefficients
And here is (some of) my output:
> summary(firstCox)$coefficients
coef exp(coef) se(coef) z Pr(>|z|)
EFST 3.644083e-03 1.003651e+00 0.0001340611 27.1822581 1.052851e-162
Status -2.926090e+00 5.360625e-02 0.3182658189 -9.1938542 3.790122e-20
AADACL3 1.502153e+02 1.728460e+65 0.3665374081 409.8224582 0.000000e+00
AADACL4 5.857192e+01 2.738174e+25 0.3681708023 159.0889828 0.000000e+00
ACADM 2.455978e+02 4.589695e+106 0.2175220391 1129.0710334 0.000000e+00
ACAP3 4.093913e+02 6.256964e+177 0.2756635268 1485.1121632 0.000000e+00
ACOT11 1.940976e+01 2.688751e+08 0.3251033140 59.7033512 0.000000e+00
ACOT7 -2.841794e+02 3.823403e-124 0.3139848504 -905.0736377 0.000000e+00
ACTB -5.562202e+01 6.976896e-25 0.3173481100 -175.2713234 0.000000e+00
ACTL8 -4.017414e+02 3.356676e-175 0.3435128215 -1169.5093020 0.000000e+00
ACTRT2 -7.613568e+01 8.603881e-34 0.2861088372 -266.1074036 0.000000e+00
ADC -1.244476e+02 8.976070e-55 0.3201452217 -388.7223972 0.000000e+00
ADPRHL2 4.887427e+01 1.681998e+21 0.2895110526 168.8165913 0.000000e+00
AGMAT 7.266946e+02 Inf 0.4295874196 1691.6104194 0.000000e+00
AGO1 3.352041e+02 3.778188e+145 0.2633158947 1273.0111995 0.000000e+00
...
And here is what dput(firstFinalData[1:10, 1:10])
produces:
structure(list(SurvObj = structure(c(444, 5553, 5296, 922, 205,
47, 401, 245, 263, 5564, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0), .Dim = c(10L,
2L), .Dimnames = list(NULL, c("time", "status")), type = "right", class = "Surv"),
EFST = c(444L, 5553L, 5296L, 922L, 205L, 47L, 401L, 245L,
263L, 5564L), Status = c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), AADACL3 = c(5.52132, 5.64712, 5.45876, 5.71481,
5.1269, 5.88764, 5.08912, 4.91729, 5.65387, 5.59824), AADACL4 = c(5.17251,
5.41843, 5.10969, 5.23402, 4.60353, 5.70923, 5.02245, 5.1466,
4.8355, 4.83986), ACADM = c(7.47834, 7.43494, 7.91155, 7.86337,
8.39009, 6.16251, 7.83793, 7.71742, 6.98061, 7.78087), ACAP3 = c(7.80589,
8.00354, 7.75014, 7.61566, 7.55267, 7.9449, 7.20561, 7.99776,
7.72778, 7.43355), ACOT11 = c(6.75915, 6.30386, 6.38214,
6.54392, 6.64743, 6.78981, 6.42641, 6.58761, 6.66693, 6.53731
), ACOT7 = c(8.11807, 8.38011, 7.8349, 8.43645, 8.11502,
8.0109, 7.6866, 8.55327, 8.17004, 7.44455), ACTB = c(10.8227,
11.4556, 11.4216, 11.332, 10.9536, 9.83797, 11.2352, 11.5006,
11.1817, 10.895)), row.names = c(NA, 10L), class = "data.frame")
Thank you so much!
Edit:
I also get this warning message when I run firstCox <- coxph(SurvObj ~ ., data = firstFinalData)
:
In fitter(X, Y, istrat, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
回答1:
If you wanted to perform multiple Cox regression models with a single predictor, you could use the following code using your posted example data. First I remove the survival object in the first column.
myData <- finalData[,-1]
library(survival)
firstCox <- co
coxph(Surv(EFST, Status) ~ ., data = myData)
This returns a warning as stated (too many predictors)
Warning message:
In fitter(X, Y, istrat, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
To run multiple univariate models, first create a list of univariate formulas:
formulas <- sapply(names(myData)[3:9], function(x) as.formula(paste('Surv(EFST, Status) ~ ',x)))
Create a list of models using the coxph
function:
models <- lapply(formulas, function(x) coxph(x, data=myData))
Extract the hazard ratios (exp(coef)
) and 95% confidence intervals:
res <- lapply(models, function(x) return(cbind(HR=exp(coef(x)), exp(confint(x)), Pval=coef(summary(x))[5])))
res
$AADACL3
HR 2.5 % 97.5 % Pval
AADACL3 0.1858129 0.008579879 4.024119 0.283442
$AADACL4
HR 2.5 % 97.5 % Pval
AADACL4 0.8481017 0.02748128 26.17333 0.9249839
...
回答2:
Except for the first two coefficients (EFST
and Status
), the coefficients for all other genes are either extremely small or extremely large, leading to very large negative/positive t-statistics, which explains the p-values you're seeing.
I'm not exactly sure I understood what you're doing. Wouldn't regressing on 6,000 genes in your 249 patients data mean that you're having way more parameters than observations?
In which case, you're running into overfitting issues which would explain the parameter estimates.
回答3:
Don't include the Surv() object in the data frame.
firstFinalData <- firstFinalData[,-1]
firstCox <- coxph(Surv(EFST, Status) ~ ., data = firstFinalData)
It should work (edit: on a smaller number of variables).
As Maurits Evers says, running a model with 6,000 predictor variables (genes) on only 249 subjects will result in convergence problems. Consider reducing the number of genes (or obtaining more patients!)
来源:https://stackoverflow.com/questions/60386233/getting-p-values-of-zero-in-cox-regression-r