I have a dataframe, and I want to produce a table of summary statistics including number of valid numeric values, mean and sd by group for each of three columns. I can\'t seem
colSums(!is.na(x))
should work.
These are a few add-on packages that might help (see Quick-R)
Using the Hmisc package
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores
Using the pastecs package
library(pastecs)
stat.desc(mydata)
# nbr.val, nbr.null, nbr.na, min max, range, sum,
# median, mean, SE.mean, CI.mean, var, std.dev, coef.var
Using the psych package
library(psych)
describe(mydata)
# item name ,item number, nvalid, mean, sd,
# median, mad, min, max, skew, kurtosis, se
I'd use describe.by from the psych package;
> describe.by(biastable, as.factor(Nominal))
group: 1
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN 0.00
Actual 2 8 0.12 0.01 0.12 0.12 0.01 0.11 0.13 0.03 0.09 -1.47 0.00
LinPred 3 8 0.99 0.08 0.98 0.99 0.10 0.89 1.09 0.20 0.04 -1.70 0.03
QuadPred 4 8 0.99 0.08 0.99 0.99 0.10 0.88 1.09 0.20 -0.04 -1.64 0.03
------------------------------------------------------------------------
group: 3
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 3.00 0.00 3.00 3.00 0.00 3.00 3.00 0.00 NaN NaN 0.00
Actual 2 9 0.37 0.03 0.36 0.37 0.03 0.32 0.42 0.10 0.15 -1.50 0.01
LinPred 3 9 3.12 0.24 3.05 3.12 0.30 2.79 3.50 0.71 0.15 -1.52 0.08
QuadPred 4 9 3.10 0.23 3.06 3.10 0.34 2.79 3.46 0.67 0.12 -1.51 0.08
------------------------------------------------------------------------
group: 6
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 6.00 0.00 6.00 6.00 0.00 6.00 6.00 0.00 NaN NaN 0.00
Actual 2 9 0.71 0.04 0.70 0.71 0.04 0.66 0.78 0.12 0.46 -1.30 0.01
LinPred 3 9 6.02 0.30 5.91 6.02 0.28 5.61 6.47 0.86 0.28 -1.43 0.10
QuadPred 4 9 5.99 0.31 5.93 5.99 0.25 5.55 6.49 0.94 0.26 -1.26 0.10
------------------------------------------------------------------------
group: 10
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 10.00 0.00 10.00 10.00 0.00 10.00 10.00 0.00 NaN NaN 0.00
Actual 2 9 1.16 0.07 1.14 1.16 0.09 1.06 1.25 0.19 0.09 -1.71 0.02
LinPred 3 9 9.85 0.60 9.76 9.85 0.74 9.16 10.72 1.56 0.24 -1.76 0.20
QuadPred 4 9 9.79 0.62 9.63 9.79 0.72 9.05 10.78 1.72 0.27 -1.65 0.21
------------------------------------------------------------------------
group: 30
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 30.00 0.00 30.00 30.00 0.00 30.00 30.00 0.00 NaN NaN 0.00
Actual 2 9 3.53 0.22 3.51 3.53 0.21 3.25 3.85 0.60 0.23 -1.58 0.07
LinPred 3 9 30.08 1.55 29.88 30.08 1.44 27.70 32.66 4.96 0.21 -1.27 0.52
QuadPred 4 9 29.92 1.51 30.00 29.92 1.44 27.44 32.38 4.94 0.04 -1.22 0.50
------------------------------------------------------------------------
group: 50
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 50.00 0.00 50.00 50.00 0.00 50.00 50.00 0.00 NaN NaN 0.00
Actual 2 9 5.91 0.51 5.82 5.91 0.43 5.43 6.94 1.51 0.90 -0.73 0.17
LinPred 3 9 50.40 3.98 48.77 50.40 3.21 44.89 57.37 12.48 0.49 -1.16 1.33
QuadPred 4 9 50.24 3.97 48.91 50.24 2.65 44.49 57.01 12.52 0.39 -1.21 1.32
------------------------------------------------------------------------
group: 150
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 150.00 0.00 150.00 150.00 0.00 150.00 150.00 0.00 NaN NaN 0.00
Actual 2 6 17.23 0.97 17.20 17.23 0.67 15.90 18.80 2.90 0.25 -1.23 0.39
LinPred 3 6 147.19 8.11 147.01 147.19 11.13 138.04 155.39 17.36 -0.01 -2.22 3.31
QuadPred 4 6 147.77 7.95 147.48 147.77 10.95 139.60 157.78 18.17 0.07 -2.10 3.25
------------------------------------------------------------------------
group: 250
var n mean sd median trimmed mad min max range skew kurtosis se
Nominal 1 9 250.00 0.00 250.00 250.00 0.00 250.00 250.00 0.00 NaN NaN 0.00
Actual 2 9 28.83 1.18 28.70 28.83 0.89 27.10 31.20 4.10 0.59 -0.57 0.39
LinPred 3 9 246.29 10.57 245.98 246.29 9.31 231.46 264.81 33.35 0.33 -1.26 3.52
QuadPred 4 9 251.51 8.84 248.45 251.51 5.08 240.41 268.30 27.89 0.62 -1.04 2.95
>
What are "blank values" and "text values"? If you have numeric vector then you could have NA's (is.na()
), Inf's (is.infinite()
), NaN's (is.nan()
) and "valid" numeric values.
For "valid" numeric values (in the sense above) you could use is.finite()
:
is.finite(c(1,NA,Inf,NaN))
# [1] TRUE FALSE FALSE FALSE
sum( is.finite(c(1,NA,Inf,NaN)) )
# [1] 1
So colSums(is.numeric(x))
could be done like colSums(is.finite(x))
.
Can you use something like this?
length(unique(x))
Does complete.cases
(or sum(complete.cases)
) do what you want?