问题
I am fairly new to R, and have been using data.table a lot recently for a project involving manipulation of large data sets, specifically genome data. One of the columns is the chromosome number/name, which is formatted as "chr_", where the _ is 1-22, X, or Y. As the data is sorted by chromosomal position, this is a natural primary key for my data. However, setting this as the key produces unwanted results, namely sorting by lexicographic order rather than general numeric order (i.e. the order is 1,10,11,...,19,2,20,...,X,Y rather than 1,2,...,9,10,11,...,19,20,...,X,Y). I looked at the documentation for the factor() function, which includes an option ordered
, which implicitly reads the factor levels as ordered. However, I do not know of a way of specifying that the chromosome column should be an ordered factor, as the only related options are stringsAsFactors (this would convert all strings to factors, which would be highly inefficient considering the number of non-unique strings in other columns) and colClasses, where I don't know of any method of casting columns to implicitly ordered factors.
Does anyone know of an implementation of implicitly ordered factors for fread(), or any efficient method for data.table to convert a character column to an ordered factor?
NOTE:
I am mainly looking for the most efficient implementations, preferably ones that directly cast the column to an ordered factor during the read itself.
回答1:
From the description, it seems like this might help
set.seed(42)
dat <- data.frame(chrN= sample(c(paste0("chr", c(1:22, "X", "Y"))), 24, replace=FALSE), value=rnorm(24), stringsAsFactors=FALSE)
library(gtools)
dat[mixedorder(dat[,1]),]
ordered(dat[,1], levels=mixedsort(unique(dat[,1])))
#[1] chr22 chrY chr7 chr18 chr13 chr10 chr14 chr3 chr11 chr16 chrX chr19
#[13] chr12 chr17 chr5 chr9 chr8 chr1 chr15 chr6 chr4 chr21 chr2 chr20
#24 Levels: chr1 < chr2 < chr3 < chr4 < chr5 < chr6 < chr7 < chr8 < ... < chrY
回答2:
Just specify the levels for the factor directly.
d <- data.frame(chr=sample(c(1:22, "X", "Y"), 100, replace=T))
d$chr <- factor(d$chr, levels=c(1:22, "X", "Y"))
ordered(d$chr)
The output is
[1] 8 8 4 18 6 4 8 17 14 17 8 Y 16 3 15 22 9 16 11 17 12 17 12 11 18
[26] 16 X 10 15 7 18 6 Y Y 21 13 21 2 2 Y 21 8 4 21 X 6 12 19 14 10
[51] 7 15 10 19 4 21 20 14 18 4 4 11 7 14 17 17 2 9 1 11 16 17 19 14 1
[76] 19 12 18 18 13 10 17 21 18 17 Y Y 4 21 19 17 5 Y X 7 8 18 22 13 5
24 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10 < 11 < 12 < 13 < ... < Y
来源:https://stackoverflow.com/questions/25853575/fread-read-certain-row-as-implicitly-ordered-factor