R and Python Give Different Results (Median, IQR, Mean, and STD)

后端 未结 1 1084
悲哀的现实
悲哀的现实 2021-01-28 15:56

I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:

1条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-28 16:50

    tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.

    (This is more of an extended comment than an answer, but it was getting awkwardly long.)

    • you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.

    • how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)

    • medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).

    • IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.

    • median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.

    • mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:

     x <- rnorm(1000000,mean=0,sd=1)
     > mean(x)
     [1] 0.001386724
     > sum(x)/length(x)
     [1] 0.001386724
     > mean(x)-sum(x)/length(x)
     [1] -1.734723e-18
    

    Similarly, there are more- and less-stable ways to compute a variance/standard deviation.

    0 讨论(0)
提交回复
热议问题