I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards.
(The reason I want to do this is, that I want to run TraMineR::seqsubm()
to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes()
on it. TraMineR::seqsubm()
doesn't like NA
states and cluster::agnes()
with NA
states in the matrix doesn't necessarily make much sense.)
For that purpose I already wrote a working function that computes by principle all possible power-subsets and checks them for NA
s. It works well with this toy data d
which represents a 10x5 matrix:
> d
id X1 X2 X3 X4 X5
1 A 1 11 21 31 41
2 B 2 12 22 32 42
3 C 3 13 23 33 NA
4 D 4 14 24 34 NA
5 E 5 15 25 NA NA
6 F 6 16 26 NA NA
7 G 7 17 NA NA NA
8 H 8 18 NA NA NA
9 I 9 NA NA NA NA
10 J 10 NA NA NA NA
11 K NA NA NA NA NA
The problem now is that I actually want to apply the algorithm to survey data that would represent a 34235 x 17 matrix!
My code has been reviewed on Code Review, but still cannot be applied to the real data.
I am aware that with this approach there would be a huge calculation. (Presumably too huge for non-supercomputers?!)
Does anyone know a more suitable approach?
I show you the already enhanced function by @minem from Code Review:
seqRank2 <- function(d, id = "id") {
require(matrixStats)
# change structure, convert to matrix
ii <- as.character(d[, id])
dm <- d
dm[[id]] <- NULL
dm <- as.matrix(dm)
rownames(dm) <- ii
your.powerset = function(s){
l = vector(mode = "list", length = 2^length(s))
l[[1]] = numeric()
counter = 1L
for (x in 1L:length(s)) {
for (subset in 1L:counter) {
counter = counter + 1L
l[[counter]] = c(l[[subset]], s[x])
}
}
return(l[-1])
}
psr <- your.powerset(ii)
psc <- your.powerset(colnames(dm))
sss <- lapply(psr, function(x) {
i <- ii %in% x
lapply(psc, function(y) dm[i, y, drop = F])
})
cn <- sapply(sss, function(x)
lapply(x, function(y) {
if (ncol(y) == 1) {
if (any(is.na(y))) return(NULL)
return(y)
}
isna2 <- matrixStats::colAnyNAs(y)
if (all(isna2)) return(NULL)
if (sum(isna2) == 0) return(NA)
r <- y[, !isna2, drop = F]
return(r)
}))
scr <- sapply(cn, nrow)
scc <- sapply(cn, ncol)
namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
names(scr) <- namesCN
scr <- unlist(scr)
names(scc) <- namesCN
scc <- unlist(scc)
m <- t(rbind(n.obs = scr, sq.len = scc))
ag <- aggregate(m, by = list(sequence = rownames(m)), max)
ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
rownames(ag) <- NULL
return(ag)
}
Yielding:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X1, X2, X4 4 3
4 X1, X3, X4 4 3
5 X2, X3, X4 4 3
6 X1, X2 8 2
7 X1, X3 6 2
8 X2, X3 6 2
9 X1, X4 4 2
10 X2, X4 4 2
11 X3, X4 4 2
12 X1 10 1
13 X2 8 1
14 X3 6 1
15 X4 4 1
16 X5 2 1
> system.time(x <- seqRank2(d))
user system elapsed
1.93 0.14 2.93
In this case I would choose X1, X2, X3, X4
, X1, X2, X3
or X2, X3, X4
because they're continuous and yield an appropriate number of observations.
Expected output:
So for toy data d
the expected output would be something like:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X2, X3, X4 4 3
4 X1, X2 8 2
5 X2, X3 6 2
6 X3, X4 4 2
7 X1 10 1
8 X2 8 1
9 X3 6 1
10 X4 4 1
11 X5 2 1
And at the end the function should run properly on the huge matrix d.huge
which leads to errors at the moment:
> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) :
vector size cannot be infinite
Toy data d
:
d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L,
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA,
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA,
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")
Toy data d.huge
:
d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)),
c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
x[which(x %in% sample(x, .05*length(x)))] <- NA
x
})
Appendix (see comments latest answer):
d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)
This takes less than one second on the huge data
l1 = combn(2:length(d), 2, function(x) d[x[1]:x[2]], simplify = FALSE)
# If you also need "combinations" of only single columns, then uncomment the next line
# l1 = c(d[-1], l1)
l2 = sapply(l1, function(x) sum(complete.cases(x)))
score = sapply(1:length(l1), function(i) NCOL(l1[[i]]) * l2[i])
best_score = which.max(score)
best = l1[[best_score]]
The question was unclear about how to rank the various combinations. We can use different scoring formulae to generate different preferences. For example, to weight number of rows versus columns separately we can do
col_weight = 2
row_weight = 1
score = sapply(1:length(l1), function(i) col_weight*NCOL(l1[[i]]) + row_weight * l2[i])
Convert to matrix and calculate Na counts for each column:
dm <- is.na(d[, -1])
na_counts <- colSums(dm)
x <- data.frame(na_counts = na_counts, non_na_count = nrow(dm) - na_counts)
x <- as.matrix(x)
# create all combinations for column indexes:
nx <- 1:nrow(x)
psr <- do.call(c, lapply(seq_along(nx), combn, x = nx, simplify = FALSE))
# test if continuous:
good <- sapply(psr, function(y) !any(diff(sort.int(y)) != 1L))
psr <- psr[good == T] # remove non continuous
# for each combo count nas and non NA:
s <- sapply(psr, function(y) colSums(x[y, , drop = F]))
# put all together in table:
res <- data.frame(var_count = lengths(psr), t(s))
res$var_indexes <- sapply(psr, paste, collapse = ',')
res
# var_count na_counts non_na_count var_indexes
# 1 1 1 10 1
# 2 1 3 8 2
# 3 1 5 6 3
# 4 1 7 4 4
# 5 1 9 2 5
# 6 2 4 18 1,2
# 7 2 8 14 2,3
# 8 2 12 10 3,4
# 9 2 16 6 4,5
# 10 3 9 24 1,2,3
# 11 3 15 18 2,3,4
# 12 3 21 12 3,4,5
# 13 4 16 28 1,2,3,4
# 14 4 24 20 2,3,4,5
# 15 5 25 30 1,2,3,4,5
# choose
As var indexes are sorted, for speed we can use simply:
good <- sapply(psr, function(y) !any(diff(y) != 1L))
Just to clarify, the seqsubm
function from TraMineR
has no problem at all with NAs, nor with sequences of different length. However, the function expects a state sequence object (to be created with seqdef
) as input.
The function seqsubm
is for computing substitution costs (i.e. dissimilarities) between states by means of different methods. You probably refer to the method ('TRATE'
) that derives the costs from the observed transition probabilities, namely as 2-p(i|j) - p(j|i), where p(i|j) is the probability to be in state i in t when we were in state j in t-1. So, all we need are the transition probabilities, which can easily be estimated from a set of sequences of different length or with gaps within them.
I illustrate below using the ex1
data that ships with TraMineR
. (Due to the high number of different states in your toy example, the resulting matrix of substitution costs would be too large (28 x 28) for this illustration.)
library(TraMineR)
data(ex1)
sum(is.na(ex1))
# [1] 38
sq <- seqdef(ex1[1:13])
sq
# Sequence
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 *-D-D-D-D-D-D-D-D-D-D
# s4 A-A-*-*-B-B-B-B-D-D
# s5 A-*-A-A-A-A-*-A-A-A
# s6 *-*-*-C-C-C-C-C-C-C
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*
sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)
# A-> B-> C-> D->
# A-> 0 2.000 2 2.000
# B-> 2 0.000 2 1.823
# C-> 2 2.000 0 2.000
# D-> 2 1.823 2 0.000
Now, it is not clear to me what you want to do with the state dissimilarities. Inputting them in a clustering algorithm, you would cluster the states. If you want to cluster the sequences, then you should first compute dissimilarities between sequences (using seqdist
and possibly passing the matrix of substitution costs returned by seqsubm
as sm
argument) and then input the resulting distance matrix in the clustering algorithm.
来源:https://stackoverflow.com/questions/53613882/how-to-get-the-largest-possible-column-sequence-with-the-least-possible-row-nas