问题

I am a student conducting a gene expression survival analysis in R. I have the expression data for 249 patients, and I am using 6,000 genes as well as their event-free survival times and vital state as response variables. When I tried to run the Cox regression on my dataset, I got extremely strange results (p-values of 0.00 and strange hazard ratios). I have checked over my code multiple times, but I am not able to catch my mistake (when I tried earlier with just one gene, it worked fine, but when I try to test multiple genes using the '.' function, I am not getting porper results). I would highly appreciate any help and have attached both my code and output! Let me know if more information is needed.

library(survival)
options(expressions = 5e5)
firstSplitData <- read.delim("/Users/menon/OneDrive/Desktop/csrsef files/FirstSplitDataFrame.txt")
firstInitialData <- data.frame(firstSplitData)
firstEventFreeTime <- firstInitialData[ , c("EFST")] 
firstVitalStatus <- firstInitialData[, c("Status")]
#create a temporary object to use in the final object in order to be able to use '.'
temporaryObj <- Surv(as.numeric(firstEventFreeTime), firstVitalStatus == 2)
firstFinalData <- data.frame(SurvObj = temporaryObj)
#bind the two together for the final data 
firstFinalData <- cbind(firstFinalData, firstInitialData[, 2:ncol(firstInitialData)])
#create final cox model
firstCox <- coxph(SurvObj ~ ., data =  firstFinalData)
summary(firstCox)$coefficients

And here is (some of) my output:

> summary(firstCox)$coefficients
                     coef     exp(coef)     se(coef)             z      Pr(>|z|)
EFST         3.644083e-03  1.003651e+00 0.0001340611    27.1822581 1.052851e-162
Status      -2.926090e+00  5.360625e-02 0.3182658189    -9.1938542  3.790122e-20
AADACL3      1.502153e+02  1.728460e+65 0.3665374081   409.8224582  0.000000e+00
AADACL4      5.857192e+01  2.738174e+25 0.3681708023   159.0889828  0.000000e+00
ACADM        2.455978e+02 4.589695e+106 0.2175220391  1129.0710334  0.000000e+00
ACAP3        4.093913e+02 6.256964e+177 0.2756635268  1485.1121632  0.000000e+00
ACOT11       1.940976e+01  2.688751e+08 0.3251033140    59.7033512  0.000000e+00
ACOT7       -2.841794e+02 3.823403e-124 0.3139848504  -905.0736377  0.000000e+00
ACTB        -5.562202e+01  6.976896e-25 0.3173481100  -175.2713234  0.000000e+00
ACTL8       -4.017414e+02 3.356676e-175 0.3435128215 -1169.5093020  0.000000e+00
ACTRT2      -7.613568e+01  8.603881e-34 0.2861088372  -266.1074036  0.000000e+00
ADC         -1.244476e+02  8.976070e-55 0.3201452217  -388.7223972  0.000000e+00
ADPRHL2      4.887427e+01  1.681998e+21 0.2895110526   168.8165913  0.000000e+00
AGMAT        7.266946e+02           Inf 0.4295874196  1691.6104194  0.000000e+00
AGO1         3.352041e+02 3.778188e+145 0.2633158947  1273.0111995  0.000000e+00
...

And here is what dput(firstFinalData[1:10, 1:10]) produces:

structure(list(SurvObj = structure(c(444, 5553, 5296, 922, 205, 
47, 401, 245, 263, 5564, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0), .Dim = c(10L, 
2L), .Dimnames = list(NULL, c("time", "status")), type = "right", class = "Surv"), 
    EFST = c(444L, 5553L, 5296L, 922L, 205L, 47L, 401L, 245L, 
    263L, 5564L), Status = c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
    2L, 1L), AADACL3 = c(5.52132, 5.64712, 5.45876, 5.71481, 
    5.1269, 5.88764, 5.08912, 4.91729, 5.65387, 5.59824), AADACL4 = c(5.17251, 
    5.41843, 5.10969, 5.23402, 4.60353, 5.70923, 5.02245, 5.1466, 
    4.8355, 4.83986), ACADM = c(7.47834, 7.43494, 7.91155, 7.86337, 
    8.39009, 6.16251, 7.83793, 7.71742, 6.98061, 7.78087), ACAP3 = c(7.80589, 
    8.00354, 7.75014, 7.61566, 7.55267, 7.9449, 7.20561, 7.99776, 
    7.72778, 7.43355), ACOT11 = c(6.75915, 6.30386, 6.38214, 
    6.54392, 6.64743, 6.78981, 6.42641, 6.58761, 6.66693, 6.53731
    ), ACOT7 = c(8.11807, 8.38011, 7.8349, 8.43645, 8.11502, 
    8.0109, 7.6866, 8.55327, 8.17004, 7.44455), ACTB = c(10.8227, 
    11.4556, 11.4216, 11.332, 10.9536, 9.83797, 11.2352, 11.5006, 
    11.1817, 10.895)), row.names = c(NA, 10L), class = "data.frame")

Thank you so much!

Edit:

I also get this warning message when I run firstCox <- coxph(SurvObj ~ ., data = firstFinalData):

In fitter(X, Y, istrat, offset, init, control, weights = weights,  :
  Ran out of iterations and did not converge

回答1:

If you wanted to perform multiple Cox regression models with a single predictor, you could use the following code using your posted example data. First I remove the survival object in the first column.

myData <- finalData[,-1]

library(survival)
firstCox <- co

coxph(Surv(EFST, Status) ~ ., data =  myData)

This returns a warning as stated (too many predictors)

Warning message:
In fitter(X, Y, istrat, offset, init, control, weights = weights,  :
  Ran out of iterations and did not converge

To run multiple univariate models, first create a list of univariate formulas:

formulas <- sapply(names(myData)[3:9], function(x) as.formula(paste('Surv(EFST, Status) ~ ',x)))

Create a list of models using the coxph function:

models <- lapply(formulas, function(x) coxph(x, data=myData))

Extract the hazard ratios (exp(coef)) and 95% confidence intervals:

res <- lapply(models, function(x) return(cbind(HR=exp(coef(x)), exp(confint(x)), Pval=coef(summary(x))[5])))
res

$AADACL3
               HR       2.5 %   97.5 %     Pval
AADACL3 0.1858129 0.008579879 4.024119 0.283442

$AADACL4
               HR      2.5 %   97.5 %      Pval
AADACL4 0.8481017 0.02748128 26.17333 0.9249839
...

回答2:

Except for the first two coefficients (EFST and Status), the coefficients for all other genes are either extremely small or extremely large, leading to very large negative/positive t-statistics, which explains the p-values you're seeing.

I'm not exactly sure I understood what you're doing. Wouldn't regressing on 6,000 genes in your 249 patients data mean that you're having way more parameters than observations?

In which case, you're running into overfitting issues which would explain the parameter estimates.

回答3:

Don't include the Surv() object in the data frame.

firstFinalData <- firstFinalData[,-1]
firstCox <- coxph(Surv(EFST, Status) ~ ., data =  firstFinalData)

It should work (edit: on a smaller number of variables).

As Maurits Evers says, running a model with 6,000 predictor variables (genes) on only 249 subjects will result in convergence problems. Consider reducing the number of genes (or obtaining more patients!)

来源：https://stackoverflow.com/questions/60386233/getting-p-values-of-zero-in-cox-regression-r

标签

bioinformatics

survival-analysis

cox-regression

Getting P-Values of Zero in Cox Regression: R

问题

回答1:

This returns a warning as stated (too many predictors)

回答2:

回答3: