Factors ordered vs. levels

后端 未结 2 518
迷失自我
迷失自我 2020-12-03 18:35

Can someone explain what is the use of the \"ordered\" parameter in R?

R says:

ordered
logical flag to determine if the levels should be

相关标签:
2条回答
  • 2020-12-03 18:40

    Let's do some reading.

    From ?factor:

    levels an optional vector of the values that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)).

    So if left unspecified, it will sort the unique values for you and treat that as the order.

    As Ben mentioned, the question of how ordered and unordered factors differ in practice is much more complicated and usually relies on a presumption that you know a reasonable amount of statistics. The documentation only says:

    Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently.

    Again, as Ben mentions, many model fitting routines will treat ordered and unordered factors very differently because they have very different statistical meanings and interpretations. A detailed summary of the statistical differences is probably way beyond the scope of a StackOverflow answer.

    0 讨论(0)
  • 2020-12-03 18:54

    I'll replace your vector of names by more intuitive factors for which order makes more sense:

    heights <- c("low","medium","high")
    
    heights1 <- factor(heights, ordered = TRUE)
    heights1
    # [1] low    medium high  
    # Levels: high < low < medium
    
    heights2 <- factor(heights) # ordered = FALSE by default
    heights2
    # [1] low    medium high  
    # Levels: high low medium
    

    The order of the levels might not be the one you expect, but when you don't set an explicit order levels are sorted alphabetically.

    To set an explicit order we can do as follows:

    heights1<- factor(heights, levels = heights, ordered = TRUE)
    heights1
    # [1] low    medium high  
    # Levels: low < medium < high
    
    heights2<- factor(heights, levels = heights)
    heights2
    # [1] low    medium high  
    # Levels: low medium high
    

    You might sometimes want to use factor(x, levels = unique(x)) as levels can't be duplicated, in this case levels will be sorted by their first appearance.

    So now it's sorted on both sides, but wait, one is supposed to be "unordered". The vocabulary is misleading as sorting unordered factors is possible, and even useful if you want to tweak your layouts with ggplot2 for instance.

    However, as mentioned by @joran and @thomas, statistical models will consider categorical variables differently depending on whether they are ordered or not.

    The use of ordered factors that led me here however is in the use of max and min functions, in particular inside of aggregation functions.

    See this question and the accepted answer where having factors defined as ordered is necessary : Aggregate with max and factors

    We had this:

    # > df1
    #    id height
    # 1   1    low          
    # 2   1   high         
    # 3   2 medium          
    # 4   2    low          
    # 5   3 medium          
    # 6   3 medium          
    # 7   4    low          
    # 8   4    low          
    # 9   5 medium          
    # 10  5 medium
    

    With unordered factors we couldn't aggregate:

    # aggregate(height ~ id,df1,max)
    # Error in Summary.factor(c(2L, 2L), na.rm = FALSE) : 
    # ‘max’ not meaningful for factors
    

    With ordered factors we can!

    # aggregate(height ~ id,df1,max)
    #   id height
    # 1  1   high
    # 2  2 medium
    # 3  3 medium
    # 4  4    low
    # 5  5 medium
    
    0 讨论(0)
提交回复
热议问题