fread: read certain row as implicitly ordered factor

一笑奈何 提交于 2019-12-10 11:54:35

问题


I am fairly new to R, and have been using data.table a lot recently for a project involving manipulation of large data sets, specifically genome data. One of the columns is the chromosome number/name, which is formatted as "chr_", where the _ is 1-22, X, or Y. As the data is sorted by chromosomal position, this is a natural primary key for my data. However, setting this as the key produces unwanted results, namely sorting by lexicographic order rather than general numeric order (i.e. the order is 1,10,11,...,19,2,20,...,X,Y rather than 1,2,...,9,10,11,...,19,20,...,X,Y). I looked at the documentation for the factor() function, which includes an option ordered, which implicitly reads the factor levels as ordered. However, I do not know of a way of specifying that the chromosome column should be an ordered factor, as the only related options are stringsAsFactors (this would convert all strings to factors, which would be highly inefficient considering the number of non-unique strings in other columns) and colClasses, where I don't know of any method of casting columns to implicitly ordered factors.

Does anyone know of an implementation of implicitly ordered factors for fread(), or any efficient method for data.table to convert a character column to an ordered factor?

NOTE:

I am mainly looking for the most efficient implementations, preferably ones that directly cast the column to an ordered factor during the read itself.


回答1:


From the description, it seems like this might help

 set.seed(42)
 dat <- data.frame(chrN= sample(c(paste0("chr", c(1:22, "X", "Y"))), 24, replace=FALSE),    value=rnorm(24), stringsAsFactors=FALSE)
 library(gtools)
 dat[mixedorder(dat[,1]),]

 ordered(dat[,1], levels=mixedsort(unique(dat[,1])))
 #[1] chr22 chrY  chr7  chr18 chr13 chr10 chr14 chr3  chr11 chr16 chrX  chr19
#[13] chr12 chr17 chr5  chr9  chr8  chr1  chr15 chr6  chr4  chr21 chr2  chr20
#24 Levels: chr1 < chr2 < chr3 < chr4 < chr5 < chr6 < chr7 < chr8 < ... < chrY



回答2:


Just specify the levels for the factor directly.

d <- data.frame(chr=sample(c(1:22, "X", "Y"), 100, replace=T))
d$chr <- factor(d$chr, levels=c(1:22, "X", "Y"))
ordered(d$chr)

The output is

[1] 8  8  4  18 6  4  8  17 14 17 8  Y  16 3  15 22 9  16 11 17 12 17 12 11 18
[26] 16 X  10 15 7  18 6  Y  Y  21 13 21 2  2  Y  21 8  4  21 X  6  12 19 14 10
[51] 7  15 10 19 4  21 20 14 18 4  4  11 7  14 17 17 2  9  1  11 16 17 19 14 1 
[76] 19 12 18 18 13 10 17 21 18 17 Y  Y  4  21 19 17 5  Y  X  7  8  18 22 13 5 
24 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10 < 11 < 12 < 13 < ... < Y


来源:https://stackoverflow.com/questions/25853575/fread-read-certain-row-as-implicitly-ordered-factor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!