`setattr` on `levels` preserving unwanted duplicates (R data.table)

╄→尐↘猪︶ㄣ 提交于 2019-12-10 14:38:45

问题


key issue: using setattr to change level names, keeps unwanted duplicates.

I am cleaning some data where I have sevearl factor levels, all of which are the same, appearing as two or more distinct levels. (This error is due mostly to typos and file encoding issues) I have 153K factors, and abot 5% need to be corrected.

Example

In the following example, the vector has three levels, two of which need to be collapsed into one.

  incorrect <- factor(c("AOB", "QTX", "A_B"))   # this is how the data were entered
  correct   <- factor(c("AOB", "QTX", "AOB"))   # this is how the data *should* be

  > incorrect
  [1] AOB QTX A_B
  Levels: A_B AOB QTX   <~~ Note that "A_B" should be "AOB"

  > correct
  [1] AOB QTX AOB
  Levels: AOB QTX

The vector is part of a data.table.
Everything works fine when using the levels<- function to change the level names.
However, if using setattr, then unwanted duplicates are preserved.

mydt1 <- data.table(id=1:3, incorrect, key="id")
mydt2 <- data.table(id=1:3, incorrect, key="id")



# assigning levels, duplicate levels are dropped
levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect))

# using setattr, duplicate levels are not dropped
setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect)))

                # RESULTS
# Assigning Levels       # Using `setattr`
> mydt1$incorrect        >     mydt2$incorrect
[1] AOB QTX AOB          [1] AOB QTX AOB
Levels: AOB QTX          Levels: AOB AOB QTX   <~~~ Notice the duplicate level

Any thoughts on why this is and/or any options to change this behavior? (ie ..., droplevels=TRUE ?) Thanks


回答1:


setattr is a low level, brute force way to change attributes by reference. It doesn't know that the "levels" attribute is special. levels<- has more functionality inside it, but I suspect you may have found that levels(DT$col)<-newlevels will copy the whole of DT (base <-), hence for speed you looked to setattr.

I wouldn't say incorrect btw. It's a correct and valid factor, but just happens to have duplicate levels.

To drop the duplicate levels, I think (untested) :

mydt1[,factorCol:=factor(factorCol)]

should do it. It's possible to go faster than that by finding which levels you've changed, changing the integers to point to the first one of duplicates and then remove the dups from the levels. The call to factor() basically starts from scratch (i.e. coerces all of the factor to character and rematches).



来源:https://stackoverflow.com/questions/14757333/setattr-on-levels-preserving-unwanted-duplicates-r-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!