I've looked through a number of posts about ordering factors, but haven't quite found a match for my problem. Unfortunately, my knowledge of R is still pretty rudimentary.
I have a subset of an archaeological artifact catalog that I'm working with. I'm trying to cross-tabulate diagnostic historical artifact types and site testing locations. Easy enough with ddply or tapply.
My problem is that I want to sort the artifact types (a factor) by their mean diagnostic date (number/year), and I keep getting them alphabetically. I know I need to make it an ordered factor, but can't figure out how to order it by the year value in the other column.
IDENTIFY MIDDATE
engine-turned fine red stoneware 1769
white salt-glazed stoneware, scratch blue 1760
wrought nail, 'L' head 1760
yellow lead-glazed buff earthenware 1732
...
Which needs to be ordered:
IDENTIFY MIDDATE
yellow lead-glazed buff earthenware 1732
white salt-glazed stoneware, scratch blue 1760
wrought nail, 'L' head 1760
engine-turned fine red stoneware 1769
...
The factor (IDENTIFY) needs to be ordered by the Date (MIDDATE). I thought I had it with
Catalog$IDENTIFY<-factor(Catalog$IDENTIFY,levels=Catalog$MIDDATE,ordered=TRUE)
But get the warning:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
else paste0(labels,: duplicated levels will not be allowed
in factors anymore
IDENTIFY has ~130 factor levels and many have the same value for MIDDATE, so I need to order IDENTIFY by MIDDATE and another column TYPENAME.
A little more detail:
I have a dataframe Catalog
, which breaks down (i.e. str(Catalog)
) as:
> str(Catalog)
'data.frame': 2211 obs. of 15 variables:
$ TRENCH : Factor w/ 7 levels "DRT 1","DRT 2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ U_TYPE : Factor w/ 3 levels "EU","INC","STP": 1 1 1 1 1 1 1 1 1 1 ...
$ U_LBL : Factor w/ 165 levels "001","005","007",..: 72 72 72 72 72 72 ...
$ L_STRAT : Factor w/ 217 levels "#2-7/25","[3]",..: 4 4 4 4 4 4 89 89 89 89 ...
$ START : num 0 0 0 0 0 0 39.4 39.4 39.4 39.4 ...
$ END : num 39.4 39.4 39.4 39.4 39.4 39.4 43.2 43.2 43.2 43.2 ...
$ Qty : int 1 1 3 5 1 1 6 8 1 1 ...
$ MATNAME : Factor w/ 6 levels "Ceramics","Chipped Stone",..: 1 1 1 5 5 6 ...
$ TYPENAME: Factor w/ 9 levels "Architectural Hardware",..: 9 9 9 1 1 3 9 ...
$ CATNAME : Factor w/ 32 levels "Biface","Bottle Glass",..: 24 29 29 6 24 ...
$ IDENTIFY: Factor w/ 112 levels "amethyst bottle glass",..: 17 91 96 71 103 ...
$ BEGDATE : int 1820 1820 1830 1835 1700 NA 1670 1762 1800 1720 ...
$ ENDDATE : int 1900 1970 1860 1875 1820 NA 1795 1820 1820 1780 ...
$ OCC_LBL : Ord.factor w/ 5 levels "Late 19th Century"<..: 2 1 2 2 4 5 4 3 ...
$ MIDDATE : num 1860 1895 1845 1855 1760 ...
I need to make IDENTIFY
an ordered factor, and reorder by MIDDATE
-> TYPENAME
-> alpha by IDENTIFY
.
What I'm not really getting is how to reorder by combined orders from multiple columns.
I would just do this in the db, but a lot of what I'm running are weighted means within all sorts of cross-tabulations (e.g. weighted mean depths below ground surface for artifact classes by location)...
...doable in Access, but messy, and unpredictable. Far easier and cleaner to manage in R, but I'd rather not to have to manually re-sort the resulting tables.
What I'm trying to produce is a number of things along these lines:
>xtab.Catalog<-tapply(Catalog$Qty,list(Catalog$IDENTIFY,Catalog$TRENCH),sum)
IDENTIFY DRT1 DRT2 DRT3 DRT4 DRT5 DRT6
Staffordshire stoneware 4 NA NA NA NA NA
undecorated delftware 6 4 NA NA NA NA
unidentified wrought nail 15 9 3 1 3 NA
white salt-glazed stoneware 6 1 1 NA 2 1
white salt-glazed scratch blue 1 NA NA NA NA NA
white stoneware, slip-dipped NA NA NA NA NA NA
wrought nail, 'L' head 2 NA NA NA NA NA
wrought nail, 'rose' head 62 21 4 NA 1 1
wrought nail, 'T' head 2 NA 1 NA NA 1
yellow lead-glazed 12 NA NA NA 1 3
...
...but I need them to sort in the logical (i.e. chronological/type) order instead of alphabetical.
Here's a reproducible sample, with solution:
set.seed(0)
a = sample(1:20,replace=F)
b = sample(1:20,replace=F)
f = as.factor(letters[1:20])
> a
[1] 18 6 7 10 15 4 13 14 8 20 1 2 9 5 3 16 12 19 11 17
> b
[1] 16 18 4 12 3 5 6 1 15 10 19 17 9 11 2 8 20 7 13 14
> f
[1] a b c d e f g h i j k l m n o p q r s t
Levels: a b c d e f g h i j k l m n o p q r s t
Now for the new factor:
fn = factor(f, levels=unique(f[order(a,b,f)]), ordered=TRUE)
> fn
[1] a b c d e f g h i j k l m n o p q r s t
20 Levels: k < l < o < f < n < b < c < i < m < d < s < q < g < h < e < ... < j
Sorted on 'a', next 'b' and finally 'f' itself (although in this example, 'a' has no repeated values).
I recommend the following dplyr-based approach (h/t daattali) that can be extended to as many columns as you like:
library(dplyr)
Catalog <- Catalog %>%
arrange(MIDDATE, TYPENAME) %>% # sort your dataframe
mutate(IDENTIFY = factor(IDENTIFY, unique(IDENTIFY))) # reset your factor-column based on that order
来源:https://stackoverflow.com/questions/10758243/sort-a-factor-based-on-value-in-one-or-more-other-columns