Sorting data frame based on month-year time format

后端 未结 6 2112
孤城傲影
孤城傲影 2021-01-03 06:29

I\'m struggling with something very basic: sorting a data frame based on a time format (month-year, or, “%B-%y” in this case). My goal is to calculate various monthly statis

相关标签:
6条回答
  • 2021-01-03 06:41

    It would be easier to have separate Month and Year factors, in the correct order, and use tapply on the union of both variables, e.g.:

    ## The Month factor
    tmp09 <- within(tmp09,
                    Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                        levels = month.name)))
    ## for @Jura25's locale, we can't use the in built English constant
    ## instead, we can use this solution, from ?month.name:
    ## format(ISOdate(2000, 1:12, 1), "%B"))
    tmp09 <- within(tmp09,
                    Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                        levels = format(ISOdate(2000, 1:12, 1), "%B"))))
    ##
    ## And the Year factor
    tmp09 <- within(tmp09, Year <- factor(strftime(ExitTime, format = "%Y")))
    

    Which gives us (in my locale):

    > head(tmp09)
       Instrument AccountValue   monthYear   ExitTime    Month Year
    1         JPM         6997    april-07 2007-04-10    April 2007
    2         JPM         7261      mei-07 2007-05-29      May 2007
    3         JPM         7545     juli-07 2007-07-18     July 2007
    4         JPM         7614     juli-07 2007-07-19     July 2007
    5         JPM         7897 augustus-07 2007-08-22   August 2007
    10        JPM         7423 november-07 2007-11-02 November 2007
    

    Then use tapply with both factors:

    > with(tmp09, tapply(AccountValue, list(Month, Year), sum))
              2007
    April     6997
    May      21197
    July     29147
    August    7897
    November  7423
    

    or via aggregate:

    > with(tmp09, aggregate(AccountValue, list(Month = Month, Year = Year), sum))
         Month Year     x
    1    April 2007  6997
    2      May 2007 21197
    3     July 2007 29147
    4   August 2007  7897
    5 November 2007  7423
    
    0 讨论(0)
  • 2021-01-03 06:45

    Edit: I misunderstood the question at first. Copy the data given in the question first, then

    > tmp09 <- read.table(file="clipboard", header=TRUE)
    > Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
    [1] "Dutch_Belgium.1252"
    
    # create POSIXlt variable from monthYear
    > tmp09$d <- strptime(paste("2007", tmp09$monthYear, sep="-"), "%Y-%B-%d")
    
    # create ordered factor
    > tmp09$dFac <- droplevels(cut(tmp09$d, breaks="month", ordered=TRUE))
    > tmp09[order(tmp09$d), ]
       Instrument AccountValue   monthYear   ExitTime          d       dFac
    1         JPM         6997    april-07 2007-04-10 2007-04-07 2007-04-01
    2         JPM         7261      mei-07 2007-05-29 2007-05-07 2007-05-01
    11        KFT         6992      mei-07 2007-05-14 2007-05-07 2007-05-01
    12        KFT         6944      mei-07 2007-05-21 2007-05-07 2007-05-01
    3         JPM         7545     juli-07 2007-07-18 2007-07-07 2007-07-01
    4         JPM         7614     juli-07 2007-07-19 2007-07-07 2007-07-01
    13        KFT         7069     juli-07 2007-07-09 2007-07-07 2007-07-01
    14        KFT         6919     juli-07 2007-07-16 2007-07-07 2007-07-01
    5         JPM         7897 augustus-07 2007-08-22 2007-08-07 2007-08-01
    10        JPM         7423 november-07 2007-11-02 2007-11-07 2007-11-01
    
    > Tmp09Totals <- tapply(tmp09$AccountValue, tmp09$dFac, sum)
    > Tmp09Totals
    2007-04-01 2007-05-01 2007-07-01 2007-08-01 2007-11-01 
          6997      21197      29147       7897       7423
    
    0 讨论(0)
  • 2021-01-03 06:46

    Try using the "yearmon" class in zoo as it sorts appropriately. Below we create the sample DF data frame and then we add a YearMonth column of class "yearmon" . Finally we perform our aggregation. The actual processing is just the last two lines (the other part is just to create the sample data frame).

    Lines <-   "Instrument AccountValue   monthYear   ExitTime
    JPM         6997    april-07 2007-04-10
    JPM         7261      mei-07 2007-05-29
    JPM         7545     juli-07 2007-07-18
    JPM         7614     juli-07 2007-07-19
    JPM         7897 augustus-07 2007-08-22
    JPM         7423 november-07 2007-11-02
    KFT         6992      mei-07 2007-05-14
    KFT         6944      mei-07 2007-05-21
    KFT         7069     juli-07 2007-07-09
    KFT         6919     juli-07 2007-07-16"
    library(zoo)
    DF <- read.table(textConnection(Lines), header = TRUE)
    
    DF$YearMonth <- as.yearmon(DF$ExitTime)
    aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
    

    This gives the following:

    > aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
      YearMonth Instrument AccountValue
    1  Apr 2007        JPM         6997
    2  May 2007        JPM         7261
    3  Jul 2007        JPM        15159
    4  Aug 2007        JPM         7897
    5  Nov 2007        JPM         7423
    6  May 2007        KFT        13936
    7  Jul 2007        KFT        13988
    

    A slightly different approach and output uses read.zoo directly. It produces one column per instrument and one row per year/month. We read in the columns assigning them appropriate classes using "NULL" for the monthYear column since we won't use that one. We also specify that the time index is the 3rd column of the remaining columns and that we want the input split into columns by the 1st column. FUN=as.yearmon indicates that we want the time index to be converted from "Date" class to "yearmon" class and we aggregate everything using sum.

    z <- read.zoo(textConnection(Lines),  header = TRUE, index = 3, 
         split = 1, colClasses = c("character", "numeric", "NULL", "Date"),
         FUN = as.yearmon, aggregate = sum)
    

    The resulting zoo object looks like this:

    > z
               JPM   KFT
    Apr 2007  6997    NA
    May 2007  7261 13936
    Jul 2007 15159 13988
    Aug 2007  7897    NA
    Nov 2007  7423    NA
    

    We may prefer to keep it as a zoo object to take advantage of other functionality in zoo or we can convert it to a data frame like this: data.frame(Time = time(z), coredata(z)) which makes the time a separate column or as.data.frame(z) which uses row names for the time. fortify.zoo()z) also works.

    0 讨论(0)
  • 2021-01-03 06:49

    It looks like the main problem is how to sort a sequence of Month-Year strings chronologically. The easiest way is to pre-pend a "01" at the beginning of each Month-Year string and sort them as regular dates. So take your final data-frame Tmp09Totals, and do this:

    monYear <- rownames(Tmp09Totals)
    sortedMonYear <- format(sort( as.Date( paste('01-', monYear, sep = ''),
                                           '%d-%B-%y')), 
                           '%B-%y')
    Tmp09Totals[ sortedMonYear, , drop = FALSE]
    
    0 讨论(0)
  • 2021-01-03 06:50

    You could reorder factor levels by reorder function.

    tmp09$monthYear <- reorder(tmp09$monthYear, as.numeric(as.Date(tmp09$ExitTime)))
    

    Trick is to use numeric representation of date as number of days since 1970-01-01 (see ?Date) and use mean value of it as reference.

    0 讨论(0)
  • 2021-01-03 07:04

    An old post but worthy of a data.table approach:

    Read in data and set local as described by @caracal

    > Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
    [1] "Dutch_Belgium.1252"
    > tmp09 <- read.table(file="clipboard", header=TRUE)
    > tmp09$ExitTime <- as.Date(tmp09$ExitTime)
    

    Summarise data as requested

    require(data.table)
    > data.table(tmp09)[, 
    +                   .(Tmp09Total = sum(AccountValue)),
    +                   by = .(Date = format(ExitTime, "%B-%y"))]
              Date Tmp09Total
    1:    april-07       6997
    2:      mei-07      21197
    3:     juli-07      29147
    4: augustus-07       7897
    5: november-07       7423
    
    0 讨论(0)
提交回复
热议问题